SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features

Abstract

Assessing the naturalness of speech using mean opinion score (MOS) prediction models has positive implications for the automatic evaluation of speech synthesis systems. Early MOS prediction models took the raw waveform or amplitude spectrum of speech as input, whereas more advanced methods employed self-supervised-learning (SSL) based models to extract semantic representations from speech for MOS prediction. These methods utilized limited aspects of speech information for MOS prediction, resulting in restricted prediction accuracy. Therefore, in this paper, we propose SAMOS, a MOS prediction model that leverages both Semantic and Acoustic information of speech to be assessed. Specifically, the proposed SAMOS leverages a pretrained wav2vec2 to extract semantic representations and uses the feature extractor of a pretrained BiVocoder to extract acoustic features. These two types of features are then fed into the prediction network, which includes multi-task heads and an aggregation layer, to obtain the final MOS score. Experimental results demonstrate that the proposed SAMOS outperforms current state-of-the-art MOS prediction models on the BVCC dataset and performs comparable performance on the BC2019 dataset, according to the results of system-level evaluation metrics.

Index Terms: MOS prediction, speech quality assessment, semantic representation, acoustic feature

1 Introduction

Text-to-speech (TTS) synthesis and voice conversion (VC) are focal points in the field of speech research, where the evaluation of the quality of speech synthesized by TTS and VC systems involves both objective and subjective assessments. Common objective evaluation metrics, such as mel-cepstral distance (MCD) [1] and signal-to-noise ratio (SNR) [2], have limited correlation with human perception of speech quality. As a result, some objective measures or models related to human perception have been proposed [3, 4, 5, 6]. However, these objective methods typically require reference speech, making it impractical for evaluating synthesized speech signals. Therefore, for synthesis systems, subjective evaluation through listening tests is considered as the gold standard, with mean opinion score (MOS) testing being a common practice. In MOS testing, each listener is asked to rate the naturalness of a speech sample on a scale from 1 to 5. Due to the time-consuming and expensive nature of MOS testing, developing automated MOS prediction methods is necessary.

Refer to caption — Figure 1: Overall structure of the proposed SAMOS model, where the “concat” represents the feature concatenation operation, p represents the probability scores of various classes outputted by the classification head, r and c represent the scores outputted by the regression head and classification head respectively, and s represents the final score.

Early MOS prediction models adopted bidirectional long short-term memory (BiLSTM) based recurrent neural networks (RNNs) and convolutional neural networks (CNNs) for single-sentence score prediction [7, 8, 9] with amplitude-related features as input. These works all took the average of scores given by multiple listeners for a sentence as the target, however, different listeners actually provide different judgments for the same sentence. Realizing this, MBNet [10] and LDNet [11] considered listener information and added the scores from each listener as input, achieving some progress in prediction accuracy. Recently, with the rise of self-supervised-learning (SSL) based models trained on large-scale unlabeled data, fine-tuning SSL models and extracting SSL representationson on MOS datasets to leverage their high-level semantic information into MOS prediction models has become a widely used method, demonstrating impressive performance. Representative methods include SSL-MOS [12], which added a linear layer on SSL model for fine-tuning on MOS datasets, and MOSA-Net [13], which used pretrained HuBERT [14] to extract SSL features and fed them into BiLSTM or CNN for MOS prediction. The VoiceMOS Challenge 2022 [15] was the first competition focused on MOS prediction tasks, where top-ranking systems mostly extended the SSL-MOS. Participants made several improvements into the baseline model, e.g., adding listener information embedding [16, 17, 18], incorproating additional phoneme sequences extracted from ASR models [18], and adopting multi-task learning [16] or ensemble learning [18, 19, 20]. After this challenge, some further improvement methods have gradually been proposed, such as integrating k-nearest neighbor (KNN) classification into SSL-based MOS prediction methods [21] or introducing prosodic and linguistic features [22] as additional inputs to the model.

However, most state-of-the-art models only utilized semantic features from SSL models or used a single-task framework to predict frame-level scores with the same weight, limiting the accuracy of MOS prediction. Therefore, we introduce a novel neural MOS prediction model called SAMOS, which fully utilizes both semantic and acoustic information in speech. Semantic representations extracted from a pretrained wav2vec2 [23], acoustic features extracted by a pretrained BiVocoder [24], and listener-labeled information are jointly used as inputs to the model. The acoustic features contains compressed both amplitude and phase details, which provide more comprehensive information compared to amplitude-only related features commonly used in previous methods [7, 8, 9]. To further boost MOS prediction accuracy, the SAMOS also incorporates serveral techniques into its prediction network, such as multi-task learning heads [17], weight branch [25], and aggregation layer. Without using model ensemble strategy, the proposed SAMOS achieves state-of-the-art performance on the main track BVCC dataset [26] of the VoiceMOS Challenge 2022 with a single neural network model. Additionally, the SAMOS also achieves comparable performance to other MOS prediction methods on the out-of-domain BC2019 dataset [27].

2 Proposed Method

The model structure of SAMOS is illustrated in Figure 1. First, the feature extractor produces three types of features, i.e., semantic representations derived from a wav2vec2-based semantic module [23], acoustic features extracted by an acoustic module composed of the feature extractor of BiVocoder [24] and Conformers [28], and listener-labeled information generated from the listener ID through a learnable embedding. We concatenate these three types of features along the feature dimension and feed them into the base MOS predictor. The base MOS predictor consists of a BiLSTM network, followed by parallel classification and regression heads, which respectively output the probability distribution $p$ of score classification and regression score $r$ . We further take the expectation of the probability distribution as the classification score $c$ , and feed both $c$ and $r$ into an aggregation layer to output the final MOS score $s$ .

We employ a stage-wise training mode for SAMOS as illusrated in Figure 2.

•

Training stage 1: We first freeze the parameters of pre-trained BiVocoder and classification head, and train other components in the feature extractor and the base MOS predictor to get score $r$ outputted by regression head.
•

Training stage 2: Then, we swap the frozen and trained modules (BiVocoder remains frozen) to obtain score $c$ outputted by classification head.
•

Training stage 3: Finally, we freeze the whole feature extractor and base MOS predictor, and introduce the aggregation layer to train it separately, outputting the final score $s$ .

2.1 Feature extractor

2.1.1 Semantic module and listenr ID embedding

Following previous research settings [12], we utilize a pre-trained SSL model (wav2vec2) to extract frame-level semantic representations from raw waveform. Different from SSL-MOS [12], SAMOS processes the frame-level semantic representations through subsequent networks. To fully leverage the ratings given by each listener for a single sentence, inspired by [11], we assign IDs to all raters and feed them into a learnable embedding to obtain listener features. It is worth noting that for a single sample $x_{i}$ , there are individual scores $s_{i}^{1},s_{i}^{2},...,s_{i}^{m}$ given by $m$ listeners and the average score $\bar{s}$ across all listeners. Each individual score corresponds to the ID of the rater. We assign a virtual “mean-listener” ID to $\bar{s}$ , so during training, a sample is trained $m+1$ times. During inference, since the rater of the sample is unknown, we use the mean-listener ID as input.

Table 1: The system-level performances of SAMOS along with five baselines on the test sets of BVCC (main track) and BC2019 (OOD track), respectively. The “Y” after the model name indicates that the model adopts an ensemble strategy. The bold and underline numbers indicate the optimal and sub-optimal results, respectively.

	BVCC				BC2019
	S_MSE $\downarrow$	S_LCC $\uparrow$	S_SRCC $\uparrow$	S_KTAU $\uparrow$	S_MSE $\downarrow$	S_LCC $\uparrow$	S_SRCC $\uparrow$	S_KTAU $\uparrow$
UTMOS (Y)	0.090	0.939	0.936	0.794	0.030	0.988	0.979	0.908
T11 (Y)	0.101	0.941	0.939	0.797	0.048	0.982	0.952	0.852
SSL-MOS	0.113	0.928	0.923	0.770	0.093	0.971	0.975	0.889
UTMOS strong	0.148	0.930	0.925	0.774	0.248	0.970	0.972	0.879
DDOS	0.091	0.940	0.938	0.792	0.070	0.960	0.969	0.871
SAMOS	0.097	0.944	0.942	0.797	0.179	0.968	0.976	0.895

2.1.2 Acoustic module

The acoustic module consists of a BiVocoder and a Conformer. BiVocoder, which is our previous work [24], is a newly proposed bidirectional neural vocoder with both feature extraction and waveform generation capabilities Regarding the feature extraction module, speech amplitude and phase spectra are separately passed through a ConvNeXt v2 network [29] and a downsampling layer, followed by concatenation operation and dimension reduction layer to obtain a compressed low-dimensional continuous feature containing both amplitude and phase information. This feature is then fed into the mirrored waveform generation module for speech reconstruction. In our work, we only use the feature extraction module of BiVocoder to extract acoustic features. To further capture global information in acoustic features, we finally feed the output of the feature extraction module of BiVocoder into a Conformer to obtain the final acoustic features.

2.2 Base MOS predictor

The frame shift settings of wav2vec and BiVocoder are the same, so the number of frames for semantic and acoustic features is identical. Listener features are repeated along the time axis to match the same number of frames. These three features are concatenated along the dimension axis and injected into the base MOS predictor for MOS score prediction. Inspired by [17], we first pass the concatenated features through a BiLSTM and then generate two scores separately by the classification and regression heads.

2.2.1 Regression head

Serveral works [18, 19] consider MOS prediction as a regression task, where the model directly outputs the MOS score for each frame. When aggregating the MOS score for a sentence, directly calculating the average of frame-level scores is not reasonable because the speech quality is not uniform across all frames. Therefore, we adopt a two-branch structure [25], where one branch outputs scores for each frame and the other branch outputs corresponding weights for each frame. The final sentence-level regression score $r$ is obtained by performing a weighted average over scores of all frames. We use the contrastive loss [18] and clipped loss [10] to train the regression head. Given two sentences $x_{i},x_{j}$ , the expression for the contrastive loss is $L_{x_{i},x_{j}}^{con}=$ max $(0,|d_{x_{i},x_{j}}-\hat{d}_{x_{i},x_{j}}|-\alpha)$ , where $d_{x_{i},x_{j}}$ and $\hat{d}_{x_{i},x_{j}}$ respectively represent the difference of the labeled and predicted scores bewteen these two sentences. $\alpha>0$ is a hyperparameter. The contrastive loss aims to penalize cases where the ranking between the predicted scores of $x_{i}$ and $x_{j}$ differs from their actual labels. The clipped loss is defined as $L^{clip}(y_{i},\hat{y}_{i})=\mathbb{I}(|y_{i}-\hat{y}_{i}|>\tau)(y_{i}-\hat{y}_{i})^{2}$ , expecting to mitigate overfitting issues, where $y_{i}$ and $\hat{y}_{i}$ are the labeled and predicted scores of sentence $x_{i}$ . $\mathbb{I}$ is indicator function and $\tau$ is a hyperparameter. The overall loss function is expressed as $L=\beta{L^{clip}}+\gamma{L^{con}}$ , where $\beta$ and $\gamma$ are hyperparameters.

2.2.2 Classification head

MOS prediction can also be viewed as a classification task, predicting the probability of the score falling into various score ranges. Thus, we also introduce a classification head. The input is first processed through multiple linear layers, then averaged over all frames, and finally passed through a softmax layer to output a sentence-level probability vector with a length of 5. Assuming the vector is $[p_{1},p_{2},p_{3},p_{4},p_{5}]$ , the classification score is the expectation $c=\sum_{i=1}^{5}i\times p_{i}$ . When the rater ID is not the mean-listener, the label representing the sample is the score given by the individual rater (an integer $i$ from 1 to 5). In this case, the target is a one-hot vector with a length of 5 with the $i$ -th element being 1 and all other elements being 0. When the ID is the mean-listener, the label is $\frac{1}{M}\sum_{m=1}^{M}label^{m}$ , where $label^{m}$ is the one-hot vector representation of the rating given by the $m$ -th listener and $M$ is the number of listeners. We use the cross-entropy loss to train the classification head after finishing the training of the regression head.

2.3 Aggregation layer

Previous work [30] has indicated that the scores outputted by the classification head and regression head performed inconsistently in different score ranges. Therefore, simple addition and averaging are not ideal. Thus, we add a linear layer to aggregate score $r$ and $c$ to a final score $s$ , enabling the layer to autonomously learn the relationship between these two scores. The training criterion is to minimize the mean square error (MSE) between the score $s$ outputted by the aggregation layer and the labeled one.

Table 2: Experimental results of the abaltion studies evaluated on the BVCC and BC2019 datasets.

Model	BVCC				BC2019
Model	S_MSE $\downarrow$	S_LCC $\uparrow$	S_SRCC $\uparrow$	S_KTAU $\uparrow$	S_MSE $\downarrow$	S_LCC $\uparrow$	S_SRCC $\uparrow$	S_KTAU $\uparrow$
SAMOS	0.097	0.944	0.942	0.797	0.179	0.968	0.976	0.895
- semantic module	0.270	0.822	0.822	0.636	0.511	0.798	0.848	0.662
- acoustic module	0.088	0.935	0.931	0.783	0.223	0.962	0.960	0.846
- ID embedding	0.114	0.928	0.929	0.780	0.220	0.966	0.971	0.871
- weight branch	0.096	0.943	0.937	0.791	0.263	0.962	0.983	0.908
- regression head	0.136	0.925	0.926	0.773	0.355	0.959	0.962	0.790
- classification head	0.116	0.944	0.942	0.796	0.207	0.969	0.975	0.889
- aggregation layer	0.102	0.944	0.942	0.797	0.250	0.963	0.975	0.895

3 Experiment Setup

3.1 Dataset

In this paper, the experiments followed the same settings as the VoiceMOS Challenge 2022 [15]. The dataset includes BVCC dataset [12] from the main track and BC2019 dataset [27] from the out-of-domain (OOD) track. The BVCC dataset contains 7,106 English utterances, with the training/development/test sets split in a ratio of 70 $\%$ /15 $\%$ /15 $\%$ . The data comes from participating systems from past Blizzard Challenges (BCs), the Voice Conversion Challenges systems, and samples generated by ESPNet [31]. Each sentence in BVCC has scores from 8 listeners, which we utilized along with the initial average score. The BC2019 dataset consists of Mandarin utterances from the BC 2019, with 136, 136, and 540 utterances forming the training set, development set, and test set, respectively. Each sentence in BC2019 is scored by 10 to 17 listeners. Since the raters in BVCC and BC2019 datasets are different, we defaulted to considering the listeners as mean ones, when finetuning on BC2019.

3.2 Evaluation metrics

We used MSE, linear correlation coefficient (LCC), spearman rank correlation coefficient (SRCC), and kendall rank correlation coefficient (KTAU) to evaluate MOS prediction models. These metrics focus on different aspects. In challenges like BC or VCC, we are more interested in the ranking of participating systems, hence metrics like SRCC and KTAU are more important, while MSE is more reasonable when evaluating individual synthetic systems. In the VoiceMOS Challenge 2022 [15], the organizers used system-level SRCC to determine rankings. Considering that MOS automatic prediction models are mostly used to compare the quality differences of generated speech between systems, we used system-level (indicated by the prefix S_) metrics as the standard to evaluate different MOS prediction models in the experiments.

3.3 Implementation

For the feature extractor, we used a pre-trained BiVocoder on the VCTK-0.92 corpus [32] and a 1-layer Conformer in the acoustic module. The pre-trained wav2vec2.0 from fairseq constituted the semantic module. The output feature dimensions of the semantic and acoustic modules were 64 and 768, respectively. The ID embedding dimension was set to 128. For the base MOS predictor, 3 BiLSTM layers with 128 nodes were used. In the regression head, both the frame-level score prediction and the weight prediction used 2 linear layers, and we set $\alpha=0.1$ , $\beta=1$ , $\gamma=0.5$ , and $\tau=0.25$ . In the classification head, 2 linear layers were adopted. During training on BVCC, we trained for 1000 epochs, with a batch size of 8 and optimizer as stochastic gradient descent (SGD) with a learning rate of 0.0001. For the checkpoint saving strategy, we followed the same approach as UTMOS [18], selecting the best system-level SRCC checkpoint calculated from the development set. If the system-level SRCC didn’t decrease within 15 epochs, early stopping was applied. Since calculating metrics based on a single checkpoint may have some randomness, we choosed to average the parameters of the three best checkpoints to obtain a new model for testing. Due to the limited training data in the OOD track, we fine-tuned the SAMOS model trained on BVCC using BC2019 dataset.

3.4 Baselines

We adopted the baseline SSL-MOS from the VoiceMOS Challenge 2022 and some excellent participating systems as baselines in the experiments. Specifically, UTMOS used ensemble learning, integrating the neural network-based UTMOS strong with some machine learning methods. UTMOS ranked 1st, 2nd, 3rd, 3rd in the four system-level metrics in the main track, and 1st in all four system-level metrics in the OOD track. Currently, UTMOS strong without a phoneme encoder, is widely used for automatic MOS scoring for synthetic systems. Therefore, we separately considered UTMOS strong as a new baseline for comparison. DDOS added a classification head based on SSL-MOS, ranking 2nd in three out of four system-level metrics in the main track and 3rd in one remaining metric. However, DDOS performed poorly in the OOD track. Team T11 integrated multiple SSL models and ASR models, ranking 1st in three system-level metrics in the main track and 1st in two system-level metrics in the OOD track.

4 Results and Analysis

4.1 Comparision with baseline methods

We first compare the proposed SAMOS with the baselines. As shown in Table 1, the experimental results on the BVCC dataset indicated that the proposed SAMOS significantly outperformed baseline models on three system-level metrics emphasizing correlation. Compared to SSL-MOS, it also reduced system-level MSE. On BC2019 dataset, the SAMOS ranked 2nd in two metrics emphasizing the correctness of system rankings, just behind UTMOS. However, UTMOS adopted an ensemble strategy, integrating over a hundred learner. This method requires extensive computational resources and is not conducive to practical application. Under the condition that no ensemble strategy is used, compared to the single-neural-network-based UTMOS strong, SAMOS surpassed it on three metrics. Therefore, our proposed SAMOS also demonstrates comparable performance to the baselines on OOD data, confirming the robustness and stability of our proposed method.

4.2 Ablation studies

We then conduct ablation experiments on SAMOS to explore the roles of each component. The results are shown in Table 2. We first investigate the contribution of the semantic information, acoustic information, and listener ID informantion to the overall model performance. We can see that removing the semantic module resulted in the degradation of all the metrics on both datasets, indicating the importance of semantic representations from SSL model. When we removed the acoustic module, the results showed that the acoustic features extracted by BiVocoder was also indispensable. Removing the ID embedding, although causing a performance drop, still outperformed the original baseline SSL-MOS on the BVCC dataset. This indicates that acoustic features can effectively compensate for the performance loss caused by the absence of ID information. The weight branch in the regression head was also proven effective, as hypothesized in Section 2.2.1. Subsequently, we ablated the entire regression head, resulting in a decrease in all metrics to some extent. This indicates that treating MOS prediction as a classification task is challenging. Finally, Removing the classification head and aggregation layer both led to a decrease in MSE, indicating that the multi-task head framework had lower error compared to a single regression head. Using an aggregation layer instead of directly adding the scores from the two heads also improved prediction accuracy.

5 Conclusions

This paper presents a novel MOS prediction model called SAMOS which simultaneously utilizes semantic and acoustic information as input. To improve prediction accuracy, SAMOS employs parallel regression and classification heads, and finally outputs the final MOS score through an aggregation layer. Experimental results confirm that our proposed SAMOS significantly outperforms other baselines on system-level metrics, especially on the main track English dataset. Applying SAMOS to actual speech generation systems and guiding their training will be our future work.

References

[1] R. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proc. PACRIM, 1993, pp. 125–128.
[2] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in Proc. ICASSP, 2019, pp. 626–630.
[3] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, pp. 749–752.
[4] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. ICASSP, 2010, pp. 4214–4217.
[5] A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “ViSQOL: An objective speech quality model,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, pp. 1–18, 2015.
[6] G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Proc. Interspeech, 2021, pp. 2127–2131.
[7] B. Patton, Y. Agiomyrgiannakis, M. Terry, K. Wilson, R. A. Saurous, and D. Sculley, “AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech,” arXiv preprint arXiv:1611.09207, 2016.
[8] S.-w. Fu, T. Yu, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An end-to-end non-intrusive speech quality assessment model based on blstm,” in Proc. Interspeech, 2018, pp. 1873–1877.
[9] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang, “MOSNet: Deep learning-based objective assessment for voice conversion,” in Proc. Interspeech, 2019, pp. 1541–1545.
[10] Y. Leng, X. Tan, S. Zhao, F. Soong, X.-Y. Li, and T. Qin, “MBNet: MOS prediction for synthesized speech with mean-bias network,” in Proc. ICASSP, 2021, pp. 391–395.
[11] W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: Unified listener dependent modeling in mos prediction for synthetic speech,” in Proc. ICASSP, 2022, pp. 896–900.
[12] E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of mos prediction networks,” in Proc. ICASSP, 2022, pp. 8442–8446.
[13] R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao, “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54–70, 2022.
[14] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[15] W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540.
[16] X. Tian, K. Fu, S. Gao, Y. Gu, K. Wang, W. Li, and Z. Ma, “A transfer and multi-task learning based approach for MOS prediction,” in Proc. Interspeech, 2022, pp. 5438–5442.
[17] W.-C. Tseng, W.-T. Kao, and H.-y. Lee, “DDOS: A MOS prediction framework utilizing domain adaptive pre-training and distribution of opinion scores,” in Proc. Interspeech, 2022, pp. 4541–4545.
[18] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
[19] Z. Yang, W. Zhou, C. Chu, S. Li, R. Dabre, R. Rubino, and Y. Zhao, “Fusion of self-supervised learned models for MOS prediction,” in Proc. Interspeech, 2022, pp. 5443–5447.
[20] M. Kunešová, J. Matoušek, J. Lehečka, J. Švec, J. Michálek, D. Tihelka, M. Bulín, Z. Hanzlíček, and M. Řezáčková, “Ensemble of deep neural network models for MOS prediction,” in Proc. ICASSP, 2023, pp. 1–5.
[21] H. Wang, S. Zhao, X. Zheng, and Y. Qin, “RAMP: Retrieval-augmented MOS prediction via confidence-based dynamic weighting,” in Proc. Interspeech, 2023, pp. 1095–1099.
[22] A. Vioni, G. Maniati, N. Ellinas, J. S. Sung, I. Hwang, A. Chalamandaris, and P. Tsiakoulis, “Investigating content-aware neural text-to-speech MOS prediction using prosodic and linguistic features,” in Proc. ICASSP, 2023, pp. 1–5.
[23] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
[24] H.-P. Du, Y.-X. Lu, Y. Ai, and Z.-H. Ling, “BiVocoder: A bidirectional neural vocoder integrating feature extraction and waveform generation,” arXiv preprint arXiv:2406.02162, 2024.
[25] K. Shen, D. Yan, L. Dong, Y. Ren, X. Wu, and J. Hu, “SQAT-LD: Speech quality assessment transformer utilizing listener dependent modeling for zero-shot out-of-domain MOS prediction,” in Proc. ASRU, 2023, pp. 1–6.
[26] E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” arXiv preprint arXiv:2105.02373, 2021.
[27] Z. Wu, Z. Xie, and S. King, “The Blizzard Challenge 2019,” in Proc. Blizzard Challenge Workshop, 2019.
[28] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
[29] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt V2: Co-designing and scaling convnets with masked autoencoders,” in Proc. CVPR, 2023, pp. 16 133–16 142.
[30] B. Gyires-Tóth and C. Zainkó, “Improving self-supervised learning-based MOS prediction networks,” arXiv preprint arXiv:2204.11030, 2022.
[31] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “ESPNet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
[32] J. Yamagishi, C. Veaux, K. MacDonald et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.