This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Investigation on Applying Acoustic Feature Conversion
to ASR of Adult and Child Speech

Abstract

The performance of child speech recognition is generally less satisfactory compared to adult speech due to limited amount of training data. Significant performance degradation is expected when applying an automatic speech recognition (ASR) system trained on adult speech to child speech directly, as a result of domain mismatch. The present study is focused on adult-to-child acoustic feature conversion to alleviate this mismatch. Different acoustic feature conversion approaches, including deep neural network based and signal processing based, are investigated and compared under a fair experimental setting, in which converted acoustic features from the same amount of labeled adult speech are used to train the ASR models from scratch. Experimental results reveal that not all of the conversion methods lead to ASR performance gain. Specifically, as a classic unsupervised domain adaptation method, the statistic matching does not show an effectiveness. A disentanglement-based auto-encoder (DAE) conversion framework is found to be useful and the approach of F0 normalization achieves the best performance. It is noted that the F0 distribution of converted features is an important attribute to reflect the conversion quality, while utilizing an adult-child deep classification model to make judgment is shown to be inappropriate.

Index Terms: child speech recognition, acoustic feature conversion, unsupervised domain adaptation

1 Introduction

Automatic speech recognition (ASR) is the technology of converting speech signal into text or equivalent linguistic representation. Deep neural network (DNN) based acoustic model and language model have greatly accelerated the advancement of ASR [1, 2, 3, 4, 5, 6], making speech-enabled human-computer interaction feasible and widely accessible. However, the performance of ASR systems for child speech is generally less satisfactory, significantly falling behind state-of-the-art systems for adult speech [7, 8]. This is largely due to the difficulty in collecting sufficient and diverse data of child speech. For ASR research, it is relatively easy to acquire databases of hundreds of hours of adult speech, while child speech databases have much smaller size, e.g., tens of hours, or even are non-existing for many non-major languages. ASR models trained on a large amount of adult speech data can be used directly to decode child speech utterances. A drastic degradation of recognition performance is expected, because of the mismatch between training and test data in many aspects [9, 10].

Child speech was found to present great inter-speaker acoustic variability, which poses a number of modeling challenges [11, 12, 13]. Having a shorter vocal tract than adult[14], children produce speech with higher fundamental frequency (F0) and up-scaled overall formants. The developing articulators of children cause slower and less stable speaking rate [15]. Children tend to commit more pronunciation and grammatical errors during the process of language acquisition [9]. These characteristics all contribute to acoustic mismatch and linguistic mismatch between adult ASR models and child speech. The present study is focused mainly on the acoustic mismatch aspect.

Our goal is to develop an ASR system that can perform optimally on child speech. We consider a zero-resourced scenario for child speech, i.e., only labeled adult speech and unlabeled child speech data are available. Building an acoustic model by domain adversarial training (DAT) appears to be a straightforward approach in this scenario [16, 17]. Here we take another perspective, which is toward reducing the mismatch by transforming input features, e.g., log Mel spectrogram. We aim to convert adult speech features to be like child speech features, and hence carrying acoustic characteristic of child speech. An ASR model trained on the converted features is expected to better match child speech in the acoustic aspect, and hence better recognition performance.

There are many choices of conversion methods. In particular, the DNN-based conversion models have been prevalent in recent years. In this study, disentanglement-based auto-encoder (DAE) is adopted as a basic framework for investigation [18, 19, 20, 21]. This framework considers two coupled factors of variation in speech, namely linguistic and para-linguistic factors. The linguistic factor refers to the speech content and the para-linguistic factor covers all content-irrelevant information, including speaker identity, emotion, prosody, and speaking style. We assume that the information about the difference between adult and child speech is encoded as part of the para-linguistic factor and name it as a general speaker factor. Ideally, acoustic feature conversion between adult and child speech can be achieved by modifying this speaker factor.

Apart from the DAE, the cycle-consistent generative adversarial network (CycleGAN) is a popular approach to perform domain transfer. It was proposed for image translation in computer vision [22]. Cycle-consistent and adversarial loss are utilized to train CycleGAN without requiring paired data. It has been applied to many unsupervised domain adaptation tasks, including voice conversion [23, 24, 25]. In addition to these DNN-based conversion models, traditional signal processing methods, e.g., formant modification [26] and time-scale modification [10], can also be applied to acoustic feature conversion.

In this paper, the disentanglement-based AE framework is investigated with comparison to other acoustic feature conversion approaches under a fair setting, i.e., the ASR model is trained with the converted features from the same amount of labeled adult speech. The efficacies of component modules in the DAE framework are investigated through an ablation study.

The DAE based conversion framework will be illustrated in Section 2. Section 3 describes the experimental setup of the acoustic feature conversion and ASR model training. Section 4 gives the results and the work is concluded in Section 5.

Refer to caption
Figure 1: The overall workflow of the adult-to-child acoustic feature conversion for ASR.

2 Methods

2.1 Workflow of experimental process

The overall workflow of acoustic feature conversion process includes two steps as illustrated in Figure 1. The first step is to perform adult-to-child feature transformation via a conversion model. The conversion approaches are categorized into two streams: signal processing based methods, e.g., formant modification (FM) and time-scale modification (TSM); DNN based methods, e.g., disentanglement-based AE and CycleGAN. After feature conversion, the second step is to train an ASR model using the converted speech feature set. The degree of improvement on recognition performance on child test speech is evaluated against systems without using converted features.

2.2 AdaIN voice conversion

Voice conversion (VC) aims to modify content irrelevant information while preserving the linguistic content. The same idea is adopted for acoustic feature conversion except that the vocoder for converting spectrogram to waveform is not needed. AdaIN in [18] is a disentanglement-based auto-encoder network, which performs one-shot VC by separating content and speaker embeddings with instance normalization (IN). Three modules constitute the AdaIN. They are the content encoder EcE_{c}, the speaker encoder EsE_{s} and the decoder DD, respectively. The acoustic feature 𝐱\mathbf{x} is fed into EcE_{c} and EsE_{s} as input. EcE_{c} generates a sequence of content embeddings 𝐳𝐜\mathbf{zc}, while EsE_{s} produces the speaker representation 𝐳𝐬\mathbf{zs}. The decoder outputs 𝐱^\mathbf{\hat{x}} which is intended to reconstruct 𝐱\mathbf{x} from 𝐳𝐜\mathbf{zc} and 𝐳𝐬\mathbf{zs}. The IN is applied to the content encoder to remove speaker-related information, while the adaptive IN [27] is used to provide the global speaker information encoded by EsE_{s} to the decoder. The whole network is trained to minimize the reconstruction loss D(𝐳𝐜,𝐳𝐬)𝐱1||D(\mathbf{zc},\mathbf{zs})-\mathbf{x}||_{1} in an unsupervised manner. The conversion is performed by feeding 𝐳𝐜src\mathbf{zc}_{src} of the source speaker and 𝐳𝐬tar\mathbf{zs}_{tar} of the target speaker to the decoder, i.e., D(𝐳𝐜src,𝐳𝐬tar)D(\mathbf{zc}_{src},\mathbf{zs}_{tar}).

Refer to caption
Figure 2: The data flow of the DAE conversion framework to conduct adult-to-child conversion.

2.3 DAE acoustic feature conversion

As shown in Figure 2, the AdaIN network (in dashed block) is the core module of the DAE framework, which aims to perform adult-to-child conversion in the acoustic feature space. The subscript A/CA/C in a variable indicates that it represents the adult or the child domain. The solid-line arrows illustrate the reconstruction process in the training phase, while the dashed arrows refer to the conversion stage. 𝐳𝐬¯C\mathbf{\overline{zs}}_{C} is the representation of child domain and calculated by averaging all speaker embeddings of child speech utterances. 𝐱A2C\mathbf{x}_{A2C} denotes the converted acoustic feature, which is generated by replacing 𝐳𝐬A\mathbf{zs}_{A} with 𝐳𝐬¯C\mathbf{\overline{zs}}_{C}. In the evaluation of ASR models trained by 𝐱A2C\mathbf{x}_{A2C}, to overcome the great mismatch between the real and generated speech features, the child test speech is also performed by C2CC2C conversion, and the corresponding generated feature is denoted as 𝐱C2C\mathbf{x}_{C2C}. The whole conversion process can be expressed as:

𝐱A2C/C2C=D(𝐳𝐜A/C,𝐳𝐬¯C)\mathbf{x}_{A2C/C2C}=D(\mathbf{zc}_{A/C},\mathbf{\overline{zs}}_{C}) (1)

The original AdaIN network is trained in a fully unsupervised manner, i.e., only speech data are required. Nevertheless, the domain class label and median F0 value are used to facilitate a better conversion. A domain-critic module is built on top of 𝐳𝐜\mathbf{zc}, which is adversarially trained to force content embeddings to be domain-invariant. Since the F0 distribution is found to be an important attribute to discriminate adult and child speech, the median F0 of each utterance is estimated and used to train the F0 classifier. Moreover, the matrix subspace projection (MSP [28]) is applied to the speaker embedding space. It transforms 𝐳𝐬\mathbf{zs} into a low-dimensional attribute space, adult versus child domain in this case, which is expected to make 𝐳𝐬\mathbf{zs} contain more discriminative domain information.

3 Experimental Setup

3.1 Dataset

Acoustic feature conversion is performed to reduce mismatch between the two speech domains, namely adult speech and child speech. The adult speech (AA) data are from a subset of the AISHELL1 corpus. The child speech (CC) data are from the 2021 SLT CSRC (short for Children Speech Recognition Challenge) dataset. AISHELL1 [29] is an open-source dataset of Mandarin speech by adult speakers in reading style. It is intended and widely used for ASR research. The CSRC dataset [30] consists of two parts of child speech with different speaking styles. The first part, denoted by C1C_{1}, contains read speech. The second part, denoted by C2C_{2}, contains conversational speech. The test set of both C1C_{1} and C2C_{2} contain 2 hours of speech. The train data sets of AISHELL1 and CSRC are summarized as in Table 1.

Table 1: A summary of training datasets used in this research.
Data set AA C1C_{1} C2C_{2}
Duration (hrs) 60 24 25
# of Utts 48, 515 23, 824 25, 447
# of Speakers 137 742 133
Speaker age 18-60 7-11 4-11
Speaking style read read conversational

3.2 Feature conversion

80-dimensional log Mel spectrograms are extracted from raw audio with 25 ms window length and 10 ms hop length. All audio data are sampled at 16 kHz. The feature sets of AA, C1C_{1} and C2C_{2} are pooled together where global mean and variance normalization (GMVN) is applied. The disentanglement-based AE network is trained with the normalized features. It is noted that the speech in C2C_{2} exhibits significant mismatch with C1C_{1} in the preliminary ASR experiment. This is related to the speaking styles difference, i.e., read speech vs conversational speech. Therefore, we consider adult-to-child conversion under the same speaking style. Specifically, C1C_{1} is regarded as the target child domain of interest, meaning that there are totally three conversion types, namely A2C1A2C_{1}, C12C1C_{1}2C_{1}, and C22C1C_{2}2C_{1}.

The domain-critic module is implemented with a three-layer fully-connected (FC) network for domain classification on three classes, namely AA, C1C_{1}, and C2C_{2}. The F0 classifier adopts a similar network structure to perform a 10-class task. The utterance-wide median F0 values are divided in 10 equal intervals covering the range from 100 Hz to 350 Hz. F0 estimation is implemented by the Parselmouth library. When applying MSP on the speaker embedding space, the attribute label is designed as a 2-dimensional vector, in which the first element represents the adult/child domain and the second element represents the read/conversational speaking style. The core AdaIN network follows the same Conv1D layers architecture as described in [18]. The channel dimension is 512. The variational regularization is not imposed on the content embeddings 𝐳𝐜\mathbf{zc} [31], though it is applied in the original AdaIN network by default. The model is trained with segment_length=128segment\_length=128, batch_size=128batch\_size=128 for 100k steps. The optimizer is Adam with learning_rate=0.0005learning\_rate=0.0005 [32].

In the conversion stage, the average of speaker embeddings from the C1C_{1} development set (denoted as 𝐳𝐬¯C1\overline{\mathbf{zs}}_{C1}) is computed to represent the target child domain. The adult-to-child speech feature conversion is performed by replacing the original speaker embedding with 𝐳𝐬¯C1\overline{\mathbf{zs}}_{C1}. The decoded output will be de-normalized as the converted spectrogram.

Apart from the disentanglement-based AE framework, other conversion methods are evaluated in our experiments. The MaskCycleGAN network [33] is utilized to perform direct domain transformation from AA to C1C_{1}. Two generative models and two discriminative models are trained with 20k utterances for each domain. In terms of non-DNN methods, F0-based feature normalization conducts the formant modification by assuming a linear relation between F0 and formants on the Mel scale [26]. The target value of normalized F0 is empirically set to be 270 Hz. In addition, statistic matching on spectrum space (denoted as Stats) is evaluated [34], in which the set AA’s feature is first normalized by its mean and standard deviation (std), and then de-normalized by the mean and std of the C1C_{1} set. The Correlation Alignment (Coral [35]) is also experimented to minimize the domain shift by aligning the second-order statistics of AA and C1C_{1} distributions. The only difference with the Stats method is that Coral uses the covariance instead of the std.

3.3 End-to-end ASR

The speech recognition model used in our experiments adopts the joint CTC-attention architecture [3]. It consists of three components, namely the shared encoder, the attention decoder, and the CTC loss layer. The encoder comprises 12 Conformer layers and the decoder has 6 Transformer layers [36]. The input feature of ASR model is 80-dimensional log Mel spectrogram, either the original or the converted one. The ASR model is trained to minimize the weighted summation of the attention decoder loss and CTC loss, where the CTC loss weight is empirically set to be 0.3. The maximum number of epochs is set to be 150 for model convergence on 60-hour training data. The SpecAug [37] and GMVN are applied by default. The language model is disabled if not stated otherwise. The ASR experiments are conducted using the ESPnet toolkit [38].

Table 2: The WER (%) results on child test speech using ASR trained by different converted acoustic features and unconverted counterparts.
Type Model Train set Test set
C1C_{1} C2C_{2}
Without Conversion Baseline AA 29.0 76.6
+ LM (C) 25.8 71.7
C12C1C_{1}2C_{1} C22C1C_{2}2C_{1}
With Conversion DAE A2C1A2C_{1} 28.5 75.1
CycleGAN 31.7 \
F0-norm 28.3 74.6
Stats 30.3 76.3
Coral 30.3 77.1

4 Results and Analysis

4.1 Comparison of different conversion methods

The word error rate (WER) results of child test sets are given in the Table 2. The baseline model trained by original AA set attains 8.9%8.9\% WER on the adult test set. In C1C_{1} and C2C_{2}, the performance of baseline model suffers from great mismatch even with language model (LM). The ASR models trained with converted acoustic features are compared against the baseline. Five feature conversion methods are evaluated in our experiments. The DAE based conversion model (DAE for short) achieves 0.5%0.5\% absolute recognition improvement on the C1C_{1} test set, and 1.5%1.5\% on the C2C_{2} test set. The best recognition performance can be attained by the F0 normalization (F0-norm). Not all of the conversion methods can lead to performance gain, e.g., the CycleGAN. Although, it is still surprising to observe that the approach of statistic matching takes no effect and even makes degradation. The mean statistics of the log Mel spectrogram with 80 channels are plotted in Figure 3, in which the Stats and Coral curves overlap with that of C1C_{1}. A reasonable formants up-scale is noted in F0-norm compared to the original AA.

To investigate the efficacies of different network components in the DAE conversion model, an ablation study is carried out as shown in Table 3. The usage of domain critic and F0 classifier are represented by the symbol DAT and F0_clf. Having DAT on the content encoder is important for effective disentanglement. The role of MSP seems not to be as useful as F0_clf. Imposing variational regularization on the content encoder does not work. Besides, an experiment with vanilla DAE found that additional benefits (29.1%28.8%29.1\%\xrightarrow{}28.8\%) can be attained by training more steps (200k in this case). The limited performance improvements are noted in all cases, which may suggest the quantity of data to train DAE is insufficient.

Table 3: The ablation study on the DAE-based framework. The backslash \ represents removing that component(s).
Conversion Model Configuration WER (%) of test set
C12C1C_{1}2C_{1} C22C1C_{2}2C_{1}
DAE-based framework All 28.5 75.1
\ DAT 28.9 75.3
\ F0_clf 28.9 75.2
\ MSP 28.6 75.3
\ MSP & F0_clf 28.7 75.7
\ DAT & MSP 29.0 75.7
\ DAT & F0_clf 29.0 75.2
\ DAT & MSP
& F0_clf (vanilla)
29.1 75.8
vanilla DAE + variational KL weight: 1.0 30.2 76.7
KL weight: 0.1 29.2 76.4
KL weight: 0.01 29.0 76.1
Refer to caption
Figure 3: The mean statistic of different acoustic features
(converted A2C1A2C_{1} vs original AA & C1C_{1}).

4.2 Evaluation of converted acoustic features

Since paired speech data are not available, i.e., parallel utterances of the same content are not available from adult and child domains, spectral distance measures like the Mel cepstral distortion (MCD) are not applicable. An adult-child classification model is trained to distinguish the three domains, i.e., AA, C1C_{1} and C2C_{2}. We hypothesize that the converted features are obtained from a high-quality A2C1A2C_{1} conversion process if they are classified into the C1C_{1} domain. The percentages of different types of converted features classified as C1C_{1} are shown as in Table 4. The CycleGAN appears to perform very well, having 100%100\% converted features classified as C1C_{1}. However, the ASR model trained with these features shows performance degradation on test speech from C1C_{1}. Generally, DNN-based conversion methods are able to generate a high percentage of features classified as C1C_{1}. This may be related to the robustness issue of deep classification model. The pearson correlation coefficient between the WERs and C1C_{1} classification percentages is 0.070.07.

In view of distinctive F0 levels in adult and child speech, the utterance-wide median F0 values of these two domains are estimated. The F0 distributions are visualized in Figure 4 using 100 utterances from different types of acoustic features, including the converted and the unconverted ones. The ASR is expected to have better performance if the F0 distribution of the converted features is close to that of the C1C_{1}. The 1D Wasserstein distance is adopted to measure the discrepancy of two F0 distributions [39], which are listed in the last column of Table 4. The pearson correlation coefficient with WERs is 0.830.83.

Table 4: The objective evaluation of the converted features.
Conversion
Model
WER (%)
of C12C1C_{1}2C_{1}
Classified to
be C1C_{1} (%)
F0 distance
to C1C_{1}
None 29.0 0.0 41.8
DAE 28.5 95.7 2.8
CycleGAN 31.7 100.0 126.9
F0-norm 28.3 69.1 5.5
Stats 30.3 57.5 19.7
Coral 30.3 51.7 19.7
Refer to caption
Figure 4: The F0 distribution of different acoustic features
(converted A2C1A2C_{1} vs original AA & C1C_{1}).

5 Conclusion

In this paper, we compare the efficacies of different conversion methods to conduct adult-to-child conversion in the acoustic feature space. The DAE-based conversion framework is investigated in detail with various settings, in which the DAT and F0-guided training are shown to be useful. In addition, using an adult-child deep classification model to judge the quality of conversion is less reliable. The distance of the converted feature set’s F0 distribution to that of the target child domain presents a high correlation with the WER performance.

References

  • [1] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
  • [2] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. Courville, “Towards end-to-end speech recognition with deep convolutional neural networks,” arXiv preprint arXiv:1701.02720, 2017.
  • [3] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proc. of ICASSP.   IEEE, 2017, pp. 4835–4839.
  • [4] D. Wang, X. Wang, and S. Lv, “An overview of end-to-end automatic speech recognition,” Symmetry, vol. 11, no. 8, p. 1018, 2019.
  • [5] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” arXiv preprint arXiv:1706.02737, 2017.
  • [6] A. Zeyer, P. Bahar, K. Irie, R. Schlüter, and H. Ney, “A comparison of transformer and lstm encoder decoder models for asr,” in Proc. of ASRU.   IEEE, 2019.
  • [7] F. Claus, H. Gamboa Rosales, R. Petrick, H.-U. Hain, and R. Hoffmann, “A survey about asr for children,” in Speech and Language Technology in Education, 2013.
  • [8] G. Yeung and A. Alwan, “On the difficulties of automatic speech recognition for kindergarten-aged children,” Interspeech, 2018.
  • [9] P. G. Shivakumar and P. Georgiou, “Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations,” Computer speech & language, vol. 63, p. 101077, 2020.
  • [10] S. Shahnawazuddin, A. Kumar, V. Kumar, S. Kumar, and W. Ahmad, “Robust children’s speech recognition in zero resource condition,” Applied Acoustics, vol. 185, p. 108382, 2022.
  • [11] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
  • [12] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of children’s speech: Developmental changes of temporal and spectral parameters,” The Journal of the Acoustical Society of America, vol. 105, no. 3, pp. 1455–1468, 1999.
  • [13] P. G. Shivakumar and S. Narayanan, “End-to-end neural systems for automatic children speech recognition: An empirical study,” Computer Speech & Language, vol. 72, p. 101289, 2022.
  • [14] S. Das, D. Nix, and M. Picheny, “Improvements in children’s speech recognition performance,” in Proc. of ICASSP, vol. 1.   IEEE, 1998, pp. 433–436.
  • [15] A. Potamianos, S. Narayanan, and S. Lee, “Automatic speech recognition for children,” in Fifth European Conference on Speech Communication and Technology, 1997.
  • [16] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International conference on machine learning.   PMLR, 2015, pp. 1180–1189.
  • [17] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial training for accented speech recognition,” in Proc. of ICASSP.   IEEE, 2018, pp. 4854–4858.
  • [18] J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” arXiv preprint arXiv:1904.05742, 2019.
  • [19] S. Yuan, P. Cheng, R. Zhang, W. Hao, Z. Gan, and L. Carin, “Improving zero-shot voice style transfer via disentangled representation learning,” arXiv preprint arXiv:2103.09420, 2021.
  • [20] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” arXiv preprint arXiv:1803.02991, 2018.
  • [21] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5210–5219.
  • [22] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. of ICCV, 2017, pp. 2223–2232.
  • [23] T. Kanek o and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 26th European Signal Processing Conference (EUSIPCO).   IEEE, 2018, pp. 2100–2104.
  • [24] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in Proc. of ICASSP.   IEEE, 2019, pp. 6820–6824.
  • [25] L. Prananta, B. M. Halpern, S. Feng, and O. Scharenborg, “The effectiveness of time stretching for enhancing dysarthric speech for improved dysarthric speech recognition,” arXiv preprint arXiv:2201.04908, 2022.
  • [26] G. Yeung, R. Fan, and A. Alwan, “Fundamental frequency feature warping for frequency normalization and data augmentation in child automatic speech recognition,” Speech Communication, vol. 135, pp. 1–10, 2021.
  • [27] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501–1510.
  • [28] X. Li, C. Lin, R. Li, C. Wang, and F. Guerin, “Latent space factorisation and manipulation via matrix subspace projection,” in International Conference on Machine Learning.   PMLR, 2020, pp. 5916–5926.
  • [29] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proc. of O-COCOSDA.   IEEE, 2017, pp. 1–5.
  • [30] F. Yu, Z. Yao, X. Wang, K. An, L. Xie, Z. Ou, B. Liu, X. Li, and G. Miao, “The slt 2021 children speech recognition challenge: Open datasets, rules and baselines,” in IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2021, pp. 1117–1123.
  • [31] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [33] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Maskcyclegan-vc: Learning non-parallel voice conversion with filling in frames,” in Proc. of ICASSP.   IEEE, 2021, pp. 5919–5923.
  • [34] A. I. Mezza, E. A. Habets, M. Müller, and A. Sarti, “Unsupervised domain adaptation for acoustic scene classification using band-wise statistics matching,” in 2020 28th European Signal Processing Conference (EUSIPCO).   IEEE, 2021, pp. 11–15.
  • [35] B. Sun, J. Feng, and K. Saenko, “Correlation alignment for unsupervised domain adaptation,” in Domain Adaptation in Computer Vision Applications.   Springer, 2017, pp. 153–171.
  • [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [37] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
  • [38] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
  • [39] J. Altschuler, J. Niles-Weed, and P. Rigollet, “Near-linear time approximation algorithms for optimal transport via sinkhorn iteration,” Advances in neural information processing systems, vol. 30, 2017.