Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion

Abstract

Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a sufficiently large amount of parallel data for the model training and it suffers from significant performance degradation when the amount of training data is insufficient. To address this issue, we suggest a novel, two-stage strategy to optimize the performance on EL2SP based on seq2seq VC when a small amount of the parallel dataset is available. In contrast to utilizing high-quality data augmentations in previous studies, we first combine a large amount of imperfect synthetic parallel data of EL and normal speech, with the original dataset into VC training. Then, a second stage training is conducted with the original parallel dataset only. The results show that the proposed method progressively improves the performance of EL2SP based on seq2seq VC.

Index Terms— sequence-to-sequence voice conversion, electrolaryngeal speech to normal speech, synthetic parallel data, two-stage training

Refer to caption — Fig. 1: The overall system for improving seq2seq-based EL2SP enhancement.

1 Introduction

People suffer from laryngeal cancer or accidents have to undergo laryngectomy to partially or totally remove the laryngeal organs, including the vocal folds. As a result, they rely on a portable medical device named electrolarynx to produce electrolaryngeal (EL) speech as an alternative speaking method [1]. However, EL speech faces two major problems: 1) High energetic excitation signals are always accompanied by radiated noise, which significantly degrades the quality of EL speech and causes distress to the surrounding people. 2) Unnatural speech as the excitation signals provided to the vocal tract cannot simulate the variable F0 contours of the natural human voice. These problems can lead to significant impairment on qualities of patients’ lives, and hence urgently need to be addressed.

Voice conversion (VC) is originally designed to convert the speaker identity of speech from source speaker into target speaker while preserving the linguistic information [2], and it can also be applied as an enhancement to the EL speech to normal speech conversion (EL2SP). Many researches of EL2SP are based on conventional statistical models [3, 4], which follows a frame-to-frame mapping paradigm, i.e., the converted speech must have the same temporal structure as the source EL speech. This is detrimental for the EL2SP performance in particular prosodic feature conversion.

Compared with conventional VC models, sequence-to-sequence (seq2seq) models [5] are more powerful and flexible to address the difficulties encountered by EL2SP. Seq2seq VC models employ an attention-based encoder-decoder framework to automatically determine the output duration patterns of synthesized speech. Moreover, seq2seq models can capture long-term dependencies such as prosody, suprasegmental characteristics of F0 and speaker identity [6]. Thus, seq2seq VC is promising for optimizing the information loss and misalignments in EL speech. A few studies [7, 8] demonstrated the potential of seq2seq models in speaking-aid, the latter also showed their advantages over non-seq2seq ones. However, most seq2seq models require a huge parallel training corpus. The issues such as mispronunciations and skipped phonemes often occur especially when training data is insufficient [9].

To address this issue, the seminal work by [10] used highly intelligible synthetic voice from the text-to-speech (TTS) model to implement a many-to-one VC and applied it to hearing-impaired speech recovery. Subsequently, a seq2seq model named Voice Transformer Network (VTN) [6] was extended with synthetic parallel data (SPD) to tackle non-parallel datasets [11]. Following that, [12] investigated the effectiveness of SPD in VC between normal speakers. In light of this, we propose a novel, two-stage method to optimize the seq2seq model for more challenging EL2SP with SPD. In contrast to the aforementioned strategy in [10], we manage to use SPD with much lower quality since it is impractical to build high-performance TTS models based on the low-resource dataset of EL2SP. We investigate the feasibility of the EL2SP method and the effectiveness of SPD. The main contributions of the study are as follows.

•

Comparing to the majority systems having demanding requirements for the data, we propose an efficient and robust EL2SP system based on low-quality SPD.
•

We show that the proposed system is able to outperform a typical pretrained seq2seq VC model in terms of naturalness and intelligibility with a small and even semi-parallel original dataset.
•

The feasibility of SPD, which is an easy-to-access but unclean data, on EL2SP is verified.

2 Related works

2.1 Sequence-to-sequence voice conversion

Seq2seq VC model can encode an input source speech $\textbf{x}_{1:n}=\{\textbf{x}_{1},\textbf{x}_{2},...,\textbf{x}_{n}\}$ into a sequence of hidden representations $\textbf{h}_{1:n}=\{\textbf{h}_{1},\textbf{h}_{2},...,\textbf{h}_{n}\}$ . During the autoregressive decoding process at each time step, the attention mechanism considers the hidden representations of the encoder outputs at all previous time steps and then generates the context vector. After that, the decoder predicts the output acoustic features of the context vector and finally generates the converted speech $\textbf{y}_{1:m}=\{\textbf{y}_{1},\textbf{y}_{2},...,\textbf{y}_{m}\}$ by employing post-network optimization [13]. Based on the above mechanism, the seq2seq VC model can relax the alignment restriction to realize the mapping between different lengths, i.e., $n$ $\neq$ $m$ .

A plethora of works on constructing seq2seq VC are comprised of either recurrent neural networks (RNNs) [14] or convolutional neural networks (CNNs) [15]. In general, RNNs have difficulties in discovering effective long-term dependencies due to the properties of recursive layers, while CNNs rely on the number of the different kernels in convolutional layers, particularly in handling long sequences. Conversely, having great potential in voice processing, the Transformer [16], possesses multi-head self-attention mechanism and positional embedding, making it capable of perceiving global inputs efficiently. In our work, we find out that EL2SP can inherit the hidden representations of normal-speaker VC. Inspired by [8], the system we built is able to share the hidden representations from Transformer-based TTS model to the VC model through pretraining.

2.2 Data augmentation in voice conversion

Data augmentation is an efficient approach to increase the variety and quantity of the dataset by generating synthetic training data, thus resolving the issues of non-parallel, semi-parallel, or limited data in VC [11]. [17] presented a data augmentation method named ParaGen for frame-to-frame non-parallel VC, but required a sophisticated architecture with accurate hyperparameters and costly training for speaker disentangling to produce the augmented speech. And [10] proposed a VC system named Parratron to normalize arbitrary speaker’s speech into a single canonical target speech by leveraging synthetic data. However, building the TTS model relies on huge amount of data. By and large, most studies focus on generating high-quality augmented data. In our study, large amount of original dataset is assumed unavailable. The trained TTS models are directly implemented to produce relatively low-quality SPD for VC.

3 Proposed method

The central idea of the proposed method is to improve the performance of the seq2seq EL2SP system, while only requiring a small number of initial dataset even with the presence of non-parallel parts, and its provided text. As explained in Fig. 1, the whole system consists of three steps: (a) we first conduct the pretraining of Voice Transformer Network (VTN) using a large dataset of normal corpus (see Section 3.1); (b) then we fine-tune the Transformer-based TTS model by using a small original dataset to generate SPD for EL and normal speech, respectively (see Section 3.2); and (c) we perform the two-stage training for EL2SP with SPD and the original dataset (see Section 3.3). Further details are fleshed out in the following subsections.

3.1 Pretraining of sequence-to-sequence model

We adopt the transformer-based TTS pretraining for building a one-to-one VTN. The TTS model is similar to the VC model in decoding hidden representations into speech, while only a single speaker’s dataset and corresponding transcriptions are needed. This naturally motivates us to transfer the ability from the TTS to the VC during pretraining stages. Fig. 1 (a) depicts the specific architecture containing two pretraining stages.

Given a large normal TTS corpus, $D_{TTS}$ ={ $T_{TTS}$ , $S_{TTS}$ }, where $T_{TTS}$ and $S_{TTS}$ denote the text and speech dataset respectively. During the decoder pretraining stage, we use $D_{TTS}$ for typical TTS training. Encoder can effectively produce the hidden representations of pure linguistic information of text, thanks to which the decoder has a robust capability of capturing the relation between various speech features and linguistic information. In the encoder pretraining stage, an autoencoder is involved in training with a reconstruction loss instead of the original TTS encoder part, while freezing the pretrained decoder. Here, the $S_{TTS}$ is used as both inputs and target dataset. In this manner, the encoder can be forced to accurately learn to extract analogous linguistic hidden representations from the speech instead of the text, provided that the decoder maintains the ability to recognize linguistic information from the encoded text.

The above is the complete process of VTN pretraining. Note that we choose normal human corpus for pretraining because it can be easily obtained, and the goal for both EL2SP and normal VC is to convert the hidden linguistic representations to normal speech, thus enabling the EL2SP to inherit the useful capability. After that, fine-tuning stage can be implemented using a EL2SP corpus different from the $D_{TTS}$ .

3.2 Synthetic parallel data generation

As illustrated in Fig. 1 (b), this process aims to produce EL and normal SPD from the TTS systems. Therefore, we train not only TTS for normal speech generation but also for EL speech generation. However, the initial source EL corpus, $D_{EL}$ ={ $T_{EL}$ , $S_{EL}$ }, and the target normal corpus, $D_{N}$ ={ $T_{N}$ , $S_{N}$ }, are too small to be trained from scratch. We perform a fine-tuning with the TTS model obtained by the decoder pretraining stage based on the two corpus, respectively, obtaining the corresponding models, $TTS_{EL}$ and $TTS_{N}$ . Note that the performances of $TTS_{EL}$ and $TTS_{N}$ are still poor due to the original dataset being too small. Finally, we input an external set of text to the two systems simultaneously to generate low-quality EL and normal SPD, denoted by $S_{EL}^{SPD}$ and $S_{N}^{SPD}$ , respectively.

3.3 Two-stage EL2SP training

As presented in Fig. 1 (c), the parameters of $VTN_{pt}$ from Section 3.1, can be used as valid a priori information to initialize the VTN-based EL2SP model, thus achieving better performance and extremely efficient convergence compared to training from scratch. Hereafter, SPD and original natural speech pairs are concurrently pooled together as the training set to obtain the EL2SP model, $VTN_{1st}$ , which is denoted as Stage I. Although SPD can model the intonation and speaker identity of normal speech and EL speech, it contains some misinformation due to the limited performance of the TTS models, which may inhibit EL2SP from building perfect hidden representations. Furthermore, we hypothesize that the limited natural dataset can provide a correction for training. So, stage II is set up to obtain the final EL2SP model, $VTN_{2nd}$ , during which we initialize the parameters of $VTN_{1st}$ and re-input the original natural data pairs for training.

4 EXPERIMENTAL evaluation

4.1 Datasets and implementation

We built two original small-scale Japanese datasets to evaluate our proposed method. In one, a healthy male speaker recorded parallel utterances using an electrolarynx and his normal voice, denoted by SimuEL and NormSP, respectively. Each pair contained 413 sentences, totaling to 20 minutes. The other EL set came from a real male laryngectomee, denoted as RealEL. Since we could not record the normal corpus from this laryngectomee, we shared NormSP as the reference target speech. Note that RealEL contains only 200 sentences within 10 minutes, whose utterance contents are only part of NormSP, so this dataset belongs to a semi-parallel corpus.

Two experiments were designed based on this two datasets. For case 1, our goal was to convert SimuEL to NormSP (SimuEL-NormSP), which can simulate the normal corpus of patient was available. For case 2, we contrived to convert RealEL to NormSP (RealEL-NormSP), which was a more severe but practical case. Here, we used 20 utterances from the respective dataset as a development set, another 40 as an evaluation set, and the rest were used as a training set. To address the non-parallel part in RealEL-NormSP, we first supplemented the corresponding synthetic EL speech to utilize all feasible data for our method. Note that due to the datasets provided by two cases were different, for the sake of simplicity, we built different evaluation sets.

We used ESPnet [18] to implement the proposed system, where all the speech signals were re-sampled at 24khz, and the 80-dimensional mel filterbanks with 2048 FFT points and a 300-point frame-shift were used to extract the acoustic features. We followed Section 3.1 to build the pretrained VTN using the Japanese corpus named JSUT database [19], which contains 7696 utterances recorded by a female speaker, roughly 10 hours long. In addition, text data of JSUT database was also chosen for generating external SPD. VTN for constructing EL2SP and the TTS models for generating SPD have similar base configurations, using six-layer blocks with four attention heads in the encoder and decoder. The input feature sequence of the TTS model contains phoneme and pause information. For the waveform synthesis module, we used a Parallel WaveGAN (PWG) neural vocoder [20] to reconstruct SPD and the output of EL2SP. The corresponding speaker-dependent PWG vocoders were trained separately using SimuEL, NormSP, and RealEL. Note that the PWG vocoder for NormSP was trained from scratch, while for SimuEL and RealEL were pretrained from an EL dataset with 1000 utterances. Moreover, we set up typical Transformer-based baseline models for case 1 and case 2. Their framework and pretraining process are identical to the proposed system but use only original parallel pairs without SPD during EL2SP training, i.e., 353 for case 1 and 140 for case 2.

Table 1: Comparison results of the proposed method and
the baseline system for case 1: SimuEL-NormSP.

Systems

External SPD

Quality

EL2SP Results

Proposed methods using external SPD with different datasizes

Normal

Stage I

Stage II

CER

MCD

CER

MCD

CER

SPD-1000

35.3

16.4

5.95

42.1

5.93

39.4

SPD-2000

39.4

20.0

5.90

40.7

5.82

37.8

SPD-4000

44.5

25.6

5.87

37.4

5.78

36.9

SPD-7296

53.0

34.8

5.78

36.3

5.77

34.7

Baseline

MCD

CER

6.37

51.4

Table 2: Comparison results of the proposed method and
the baseline system for case 2: RealEL-NormSP.

Systems

External SPD

Quality

EL2SP Results

Proposed methods using external SPD with different datasizes

Normal

Stage I

Stage II

CER

MCD

CER

MCD

CER

SPD-1000

49.5

17.2

6.66

26.9

6.59

24.5

SPD-2000

54.8

21.0

6.48

27.7

6.41

25.3

SPD-4000

61.5

25.7

6.28

26.3

6.18

21.9

SPD-7296

71.7

33.7

6.26

29.1

6.24

23.3

Baseline

MCD

CER

7.17

41.3

4.2 Evaluation metrics

We performed objective evaluations using two several metrics, 1) mel cepstrum distortion (MCD), which was calculated between the ground truth samples and the generated samples to measure the spectral envelope distortion, and 2) the character error rate (CER) to evaluate the intelligibility. Meanwhile, as the quality of SPD needed to be examined in this study, we used CER as a rating criterion for SPD. Here, the ASR engine used to calculate CER was conformer-based architecture [21] pretrained from Japanese laboroTV database [22] and fine-tuned as in [23]. It could recognize all original EL and normal speech with the best CER (in %) of 14.7 and 6.9, respectively, which can be regarded as the upper bounds of ASR results to reflect the quality of SPD.

We also performed subjective tests to evaluate the perceptual performance of VC on the generated speech in terms of naturalness. An opinion test to assess naturalness was conducted in terms of mean opinion score (MOS). Listeners were asked to rate the naturalness of each given speech (in the scale of one to five). For case 1 and case 2, twenty random utterances were chosen for the naturalness tests, respectively. More than thirty listeners who can speak Japanese were recruited. Audio samples are available online¹¹1https://forpapersub.github.io/For_SLT_2022/.

4.3 Experimental results

4.3.1 Objective evaluation

To investigate the impact of SPD on EL2SP performance, we set up different amount of SPD for comparisons. Here, the term $datasize$ is used to represent the amount of the data. Note that the TTS models used to generate SPD were not universal due to the different training sets between case 1 and case 2. We built a filtering system on the ASR model to select SPD training pairs with better quality based on CER, so as to reduce the error for training. The filtering datasize was gradually increased from 1000 to use all available external SPD (7296), and performed the two-stage training procedure stated in Section 3.3. Fig. 2 indicates the specific training datasets of the baseline and the proposed method for both cases.

The objective evaluation results of two-stage method for case 1 and case 2 with different SPD datasizes are listed in Table 2 and Table 2, and the CER values of SPD are attached as a demonstration of reliability to indicate the quality of SPD.

For case 1, the CER values characterizing the quality of EL and normal SPD show a steady increase from 35.3 and 16.4 to 53.0 and 34.8, respectively. Nevertheless, the outcomes are consistently improved in both MCD and CER when the SPD is increased from 1000 to 7296 in stage I. In addition, all the results using SPD are significantly better in terms of quality and intelligibility than the baseline system. During stage II, we conducted the same training procedure for the models obtained from stage I with different SPD datasizes, which yielded better results. It can be determined that in case 1, the proposed two-stage training using SPD and original dataset is able to assist the system to build the rich-linguistic hidden representations from stage I.

In case 2, although normal SPD keeps the same quality level as in case 1, the quality of EL SPD in Table 2 is much inferior to that in Table 2. Since the datasize of the original training corpus for RealEL (140) is over half less than that for SimuEL (353), leading to worse performance of the EL TTS model. Also, the CER of EL SPD in Table 2 has a sharp growth from 49.5 to 71.7. Surprisingly, the systems in stage I still outperform the baseline system with a notable margin in terms MCD and CER even after using SPD. Here, the MCD continuously decreases as SPD datasize increases from 1000 to 7296. Furthermore, the optimization of CER is not obvious and remains at a steady level. When all SPD (7296) is pooled into training, the MCD value turns out to be the best, while the CER increases compared with the scenario of 4000. We expect that such large-scale and low intelligible SPD would improve the speech quality while affecting the accuracy of utterance contents. Next, the results of stage II are all better than stage I for each datasize, especially when more SPD with lower quality is involved in stage I, e.g., CER decreases by 5.1 on average as SPD is 4000 and 7296.

The overall results demonstrate robustness of the proposed method. i.e., the low-quality SPD allows the system to obtain a better performance than the baseline in stage I. And stage II can maximize the advantages of large-scale SPD, consequently having negligible negative impact on CER.

4.3.2 Subjective evaluation

The outcomes of SPD datasize of 4000 were randomly selected for subjective evaluation tests for both cases. Fig. 3 illustrates the results in which five types of speech were mixed into the test set. It could be observed that the seq2seq-based baseline systems return reasonable results ( $\approx$ 3) in terms of naturalness, even though the intelligibility is unideal. It shows that the listeners paid more attention to the factors such as fluency and voice quality rather than the linguistic contents. Moreover, the proposed method can yield significantly better performance from stage I compared to the baseline system. Meanwhile, although case 2 has a more difficult mapping process between different speakers, its results are only slightly worse than those of case 1, showing the effectiveness of the proposed method. In addition, the results of both cases from stage I to stage II reveal a consistent and stable improvement, which is similar to the findings in Section 4.3.1.

5 Conclusion

We demonstrate the proposed two-stage method incorporated with large-size low-quality SPD is a simple and effective optimized EL2SP compared to the baseline system. In particular, the effectiveness and robustness of low-quality SPD have been verified, and the system performance is positively correlated with the increase of SPD. Moreover, the second stage training can further diminish the negative impact of SPD and optimize the results. Interestingly, the texts that generate external SPD are from the VTN pretraining database. This confirms the high cost-effectiveness of our method. Current progress opens the path for future studies including: 1) randomly selecting SPD to design experiments to clarify the influence of the datasize and the quality of SPD on the results; and 2) re-establishing a semi-parallel dataset with less normal speech to investigate its effects on our proposed method.

6 Acknowledgement

This work was partly supported by JST CREST JPMJCR19A3, and AMED JP21dk0310114, Japan.

References

[1] M. I. Singer, and E. D. Blom, “An endoscopic technique for restoration of voice after laryngectomy,” Annals of Otology, Rhinology and Laryngology, no. 6, pp. 529-533, 1980.
[2] D. Childers, B. Yegnanarayana, and K. Wu, “Voice conversion: Factors responsible for quality,” 1985 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 10, pp. 748-751, 1985.
[3] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech,” Speech Communication, vol. 54, no. 1, pp. 134–146, 2012.
[4] K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “A hybrid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,” IEICE Transactions on Information and Systems, vol. E97-D, no. 6, pp. 1429–1437, 2014.
[5] I. Sutskever, O. Vinyals, and Q. V Le, “Sequence to Sequence Learning with Neural Networks,” Neural Information Processing Systems, pp. 3104–3112, 2014.
[6] W. C. Huang, T. Hayashi, Y. C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676-4680, 2020.
[7] W. C. Huang, K. Kobayashi, Y. H. Peng, C. F. Liu, Y. Tsao, H. M. Wang, and T. Toda, “A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion,” Proc. Interspeech, pp. 1329–1333, 2021.
[8] M. C. Yen, W. C. Huang, K. Kobayashi, Y. H. Peng, S. W. Tsai, Y. Tsao, T. Toda, J. S. Roger Jang, and H. M. Wang, “Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling,” 2021 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 650-657. 2021.
[9] J. X. Zhang, Z. H. Ling, Y. Jiang, L. J. Liu, C. Liang, and L. R. Dai, “Improving Sequence-to-sequence Voice Conversion by Adding Text-supervision,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6785–6789, 2019.
[10] F. Biadsy, R. J.Weiss, P. J. Moreno, D. Kanvesky, and Y. Jia, “Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” Proc. Interspeech, pp. 4115–4119, 2019.
[11] W. C. Huang, P. L. Tobing, Y. C. Wu, K. Kobayashi, and T. Toda, “The NU voice conversion system for the Voice Conversion Challenge 2020: On the effectiveness of sequence-to-sequence models and autoregressive neural vocoders,” ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020.
[12] D. Ma, W. C. Huang, and T. Toda, “Investigation of Text-to-Speech-based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion,” IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 870-877, 2021.
[13] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132-157, 2020.
[14] J. X. Zhang, Z. H. Ling, L. J. Liu, Y. Jiang, and L. R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 3, pp. 631–644, 2019.
[15] H. Kameoka, K. Tanaka, T. Kaneko, and N. Hojo, “ConvS2SVC: Fully convolutional sequence-to-sequence voice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp.1849-1863, 2020.
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is All you Need,” Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
[17] B. Chen, Z. Xu, and K. Yu, “Data augmentation based non-parallel voice conversion with frame-level speaker disentangler,” Speech Communication, vol. 136, pp. 14-22, 2020.
[18] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” Proc. Interspeech, pp. 2207-2211, 2018.
[19] R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis,” arXiv preprint arXiv:1711.00354, 2017.
[20] R. Yamamoto, E. Song, and J. M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6199-6203, 2020.
[21] A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” Proc. Interspeech, pp. 5036–5040, 2020.
[22] S. Ando, and H. Fujihara, “Construction of a large-scale Japanese ASR corpus on TV recordings,” 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6948–6952, 2021.
[23] L. P. Violeta, W. C. Huang, and T. Toda, “Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition,” Proc. Interspeech, pp. 41-45, 2022.