An Investigation on Applying Acoustic Feature Conversion
to ASR of Adult and Child Speech
Abstract
The performance of child speech recognition is generally less satisfactory compared to adult speech due to limited amount of training data. Significant performance degradation is expected when applying an automatic speech recognition (ASR) system trained on adult speech to child speech directly, as a result of domain mismatch. The present study is focused on adult-to-child acoustic feature conversion to alleviate this mismatch. Different acoustic feature conversion approaches, including deep neural network based and signal processing based, are investigated and compared under a fair experimental setting, in which converted acoustic features from the same amount of labeled adult speech are used to train the ASR models from scratch. Experimental results reveal that not all of the conversion methods lead to ASR performance gain. Specifically, as a classic unsupervised domain adaptation method, the statistic matching does not show an effectiveness. A disentanglement-based auto-encoder (DAE) conversion framework is found to be useful and the approach of F0 normalization achieves the best performance. It is noted that the F0 distribution of converted features is an important attribute to reflect the conversion quality, while utilizing an adult-child deep classification model to make judgment is shown to be inappropriate.
Index Terms: child speech recognition, acoustic feature conversion, unsupervised domain adaptation
1 Introduction
Automatic speech recognition (ASR) is the technology of converting speech signal into text or equivalent linguistic representation. Deep neural network (DNN) based acoustic model and language model have greatly accelerated the advancement of ASR [1, 2, 3, 4, 5, 6], making speech-enabled human-computer interaction feasible and widely accessible. However, the performance of ASR systems for child speech is generally less satisfactory, significantly falling behind state-of-the-art systems for adult speech [7, 8]. This is largely due to the difficulty in collecting sufficient and diverse data of child speech. For ASR research, it is relatively easy to acquire databases of hundreds of hours of adult speech, while child speech databases have much smaller size, e.g., tens of hours, or even are non-existing for many non-major languages. ASR models trained on a large amount of adult speech data can be used directly to decode child speech utterances. A drastic degradation of recognition performance is expected, because of the mismatch between training and test data in many aspects [9, 10].
Child speech was found to present great inter-speaker acoustic variability, which poses a number of modeling challenges [11, 12, 13]. Having a shorter vocal tract than adult[14], children produce speech with higher fundamental frequency (F0) and up-scaled overall formants. The developing articulators of children cause slower and less stable speaking rate [15]. Children tend to commit more pronunciation and grammatical errors during the process of language acquisition [9]. These characteristics all contribute to acoustic mismatch and linguistic mismatch between adult ASR models and child speech. The present study is focused mainly on the acoustic mismatch aspect.
Our goal is to develop an ASR system that can perform optimally on child speech. We consider a zero-resourced scenario for child speech, i.e., only labeled adult speech and unlabeled child speech data are available. Building an acoustic model by domain adversarial training (DAT) appears to be a straightforward approach in this scenario [16, 17]. Here we take another perspective, which is toward reducing the mismatch by transforming input features, e.g., log Mel spectrogram. We aim to convert adult speech features to be like child speech features, and hence carrying acoustic characteristic of child speech. An ASR model trained on the converted features is expected to better match child speech in the acoustic aspect, and hence better recognition performance.
There are many choices of conversion methods. In particular, the DNN-based conversion models have been prevalent in recent years. In this study, disentanglement-based auto-encoder (DAE) is adopted as a basic framework for investigation [18, 19, 20, 21]. This framework considers two coupled factors of variation in speech, namely linguistic and para-linguistic factors. The linguistic factor refers to the speech content and the para-linguistic factor covers all content-irrelevant information, including speaker identity, emotion, prosody, and speaking style. We assume that the information about the difference between adult and child speech is encoded as part of the para-linguistic factor and name it as a general speaker factor. Ideally, acoustic feature conversion between adult and child speech can be achieved by modifying this speaker factor.
Apart from the DAE, the cycle-consistent generative adversarial network (CycleGAN) is a popular approach to perform domain transfer. It was proposed for image translation in computer vision [22]. Cycle-consistent and adversarial loss are utilized to train CycleGAN without requiring paired data. It has been applied to many unsupervised domain adaptation tasks, including voice conversion [23, 24, 25]. In addition to these DNN-based conversion models, traditional signal processing methods, e.g., formant modification [26] and time-scale modification [10], can also be applied to acoustic feature conversion.
In this paper, the disentanglement-based AE framework is investigated with comparison to other acoustic feature conversion approaches under a fair setting, i.e., the ASR model is trained with the converted features from the same amount of labeled adult speech. The efficacies of component modules in the DAE framework are investigated through an ablation study.
The DAE based conversion framework will be illustrated in Section 2. Section 3 describes the experimental setup of the acoustic feature conversion and ASR model training. Section 4 gives the results and the work is concluded in Section 5.

2 Methods
2.1 Workflow of experimental process
The overall workflow of acoustic feature conversion process includes two steps as illustrated in Figure 1. The first step is to perform adult-to-child feature transformation via a conversion model. The conversion approaches are categorized into two streams: signal processing based methods, e.g., formant modification (FM) and time-scale modification (TSM); DNN based methods, e.g., disentanglement-based AE and CycleGAN. After feature conversion, the second step is to train an ASR model using the converted speech feature set. The degree of improvement on recognition performance on child test speech is evaluated against systems without using converted features.
2.2 AdaIN voice conversion
Voice conversion (VC) aims to modify content irrelevant information while preserving the linguistic content. The same idea is adopted for acoustic feature conversion except that the vocoder for converting spectrogram to waveform is not needed. AdaIN in [18] is a disentanglement-based auto-encoder network, which performs one-shot VC by separating content and speaker embeddings with instance normalization (IN). Three modules constitute the AdaIN. They are the content encoder , the speaker encoder and the decoder , respectively. The acoustic feature is fed into and as input. generates a sequence of content embeddings , while produces the speaker representation . The decoder outputs which is intended to reconstruct from and . The IN is applied to the content encoder to remove speaker-related information, while the adaptive IN [27] is used to provide the global speaker information encoded by to the decoder. The whole network is trained to minimize the reconstruction loss in an unsupervised manner. The conversion is performed by feeding of the source speaker and of the target speaker to the decoder, i.e., .

2.3 DAE acoustic feature conversion
As shown in Figure 2, the AdaIN network (in dashed block) is the core module of the DAE framework, which aims to perform adult-to-child conversion in the acoustic feature space. The subscript in a variable indicates that it represents the adult or the child domain. The solid-line arrows illustrate the reconstruction process in the training phase, while the dashed arrows refer to the conversion stage. is the representation of child domain and calculated by averaging all speaker embeddings of child speech utterances. denotes the converted acoustic feature, which is generated by replacing with . In the evaluation of ASR models trained by , to overcome the great mismatch between the real and generated speech features, the child test speech is also performed by conversion, and the corresponding generated feature is denoted as . The whole conversion process can be expressed as:
(1) |
The original AdaIN network is trained in a fully unsupervised manner, i.e., only speech data are required. Nevertheless, the domain class label and median F0 value are used to facilitate a better conversion. A domain-critic module is built on top of , which is adversarially trained to force content embeddings to be domain-invariant. Since the F0 distribution is found to be an important attribute to discriminate adult and child speech, the median F0 of each utterance is estimated and used to train the F0 classifier. Moreover, the matrix subspace projection (MSP [28]) is applied to the speaker embedding space. It transforms into a low-dimensional attribute space, adult versus child domain in this case, which is expected to make contain more discriminative domain information.
3 Experimental Setup
3.1 Dataset
Acoustic feature conversion is performed to reduce mismatch between the two speech domains, namely adult speech and child speech. The adult speech () data are from a subset of the AISHELL1 corpus. The child speech () data are from the 2021 SLT CSRC (short for Children Speech Recognition Challenge) dataset. AISHELL1 [29] is an open-source dataset of Mandarin speech by adult speakers in reading style. It is intended and widely used for ASR research. The CSRC dataset [30] consists of two parts of child speech with different speaking styles. The first part, denoted by , contains read speech. The second part, denoted by , contains conversational speech. The test set of both and contain 2 hours of speech. The train data sets of AISHELL1 and CSRC are summarized as in Table 1.
Data set | |||
---|---|---|---|
Duration (hrs) | 60 | 24 | 25 |
# of Utts | 48, 515 | 23, 824 | 25, 447 |
# of Speakers | 137 | 742 | 133 |
Speaker age | 18-60 | 7-11 | 4-11 |
Speaking style | read | read | conversational |
3.2 Feature conversion
80-dimensional log Mel spectrograms are extracted from raw audio with 25 ms window length and 10 ms hop length. All audio data are sampled at 16 kHz. The feature sets of , and are pooled together where global mean and variance normalization (GMVN) is applied. The disentanglement-based AE network is trained with the normalized features. It is noted that the speech in exhibits significant mismatch with in the preliminary ASR experiment. This is related to the speaking styles difference, i.e., read speech vs conversational speech. Therefore, we consider adult-to-child conversion under the same speaking style. Specifically, is regarded as the target child domain of interest, meaning that there are totally three conversion types, namely , , and .
The domain-critic module is implemented with a three-layer fully-connected (FC) network for domain classification on three classes, namely , , and . The F0 classifier adopts a similar network structure to perform a 10-class task. The utterance-wide median F0 values are divided in 10 equal intervals covering the range from 100 Hz to 350 Hz. F0 estimation is implemented by the Parselmouth library. When applying MSP on the speaker embedding space, the attribute label is designed as a 2-dimensional vector, in which the first element represents the adult/child domain and the second element represents the read/conversational speaking style. The core AdaIN network follows the same Conv1D layers architecture as described in [18]. The channel dimension is 512. The variational regularization is not imposed on the content embeddings [31], though it is applied in the original AdaIN network by default. The model is trained with , for 100k steps. The optimizer is Adam with [32].
In the conversion stage, the average of speaker embeddings from the development set (denoted as ) is computed to represent the target child domain. The adult-to-child speech feature conversion is performed by replacing the original speaker embedding with . The decoded output will be de-normalized as the converted spectrogram.
Apart from the disentanglement-based AE framework, other conversion methods are evaluated in our experiments. The MaskCycleGAN network [33] is utilized to perform direct domain transformation from to . Two generative models and two discriminative models are trained with 20k utterances for each domain. In terms of non-DNN methods, F0-based feature normalization conducts the formant modification by assuming a linear relation between F0 and formants on the Mel scale [26]. The target value of normalized F0 is empirically set to be 270 Hz. In addition, statistic matching on spectrum space (denoted as Stats) is evaluated [34], in which the set ’s feature is first normalized by its mean and standard deviation (std), and then de-normalized by the mean and std of the set. The Correlation Alignment (Coral [35]) is also experimented to minimize the domain shift by aligning the second-order statistics of and distributions. The only difference with the Stats method is that Coral uses the covariance instead of the std.
3.3 End-to-end ASR
The speech recognition model used in our experiments adopts the joint CTC-attention architecture [3]. It consists of three components, namely the shared encoder, the attention decoder, and the CTC loss layer. The encoder comprises 12 Conformer layers and the decoder has 6 Transformer layers [36]. The input feature of ASR model is 80-dimensional log Mel spectrogram, either the original or the converted one. The ASR model is trained to minimize the weighted summation of the attention decoder loss and CTC loss, where the CTC loss weight is empirically set to be 0.3. The maximum number of epochs is set to be 150 for model convergence on 60-hour training data. The SpecAug [37] and GMVN are applied by default. The language model is disabled if not stated otherwise. The ASR experiments are conducted using the ESPnet toolkit [38].
Type | Model | Train set | Test set | |
Without Conversion | Baseline | 29.0 | 76.6 | |
+ LM (C) | 25.8 | 71.7 | ||
With Conversion | DAE | 28.5 | 75.1 | |
CycleGAN | 31.7 | \ | ||
F0-norm | 28.3 | 74.6 | ||
Stats | 30.3 | 76.3 | ||
Coral | 30.3 | 77.1 |
4 Results and Analysis
4.1 Comparison of different conversion methods
The word error rate (WER) results of child test sets are given in the Table 2. The baseline model trained by original set attains WER on the adult test set. In and , the performance of baseline model suffers from great mismatch even with language model (LM). The ASR models trained with converted acoustic features are compared against the baseline. Five feature conversion methods are evaluated in our experiments. The DAE based conversion model (DAE for short) achieves absolute recognition improvement on the test set, and on the test set. The best recognition performance can be attained by the F0 normalization (F0-norm). Not all of the conversion methods can lead to performance gain, e.g., the CycleGAN. Although, it is still surprising to observe that the approach of statistic matching takes no effect and even makes degradation. The mean statistics of the log Mel spectrogram with 80 channels are plotted in Figure 3, in which the Stats and Coral curves overlap with that of . A reasonable formants up-scale is noted in F0-norm compared to the original .
To investigate the efficacies of different network components in the DAE conversion model, an ablation study is carried out as shown in Table 3. The usage of domain critic and F0 classifier are represented by the symbol DAT and F0_clf. Having DAT on the content encoder is important for effective disentanglement. The role of MSP seems not to be as useful as F0_clf. Imposing variational regularization on the content encoder does not work. Besides, an experiment with vanilla DAE found that additional benefits () can be attained by training more steps (200k in this case). The limited performance improvements are noted in all cases, which may suggest the quantity of data to train DAE is insufficient.
Conversion Model | Configuration | WER (%) of test set | ||
---|---|---|---|---|
DAE-based framework | All | 28.5 | 75.1 | |
\ DAT | 28.9 | 75.3 | ||
\ F0_clf | 28.9 | 75.2 | ||
\ MSP | 28.6 | 75.3 | ||
\ MSP & F0_clf | 28.7 | 75.7 | ||
\ DAT & MSP | 29.0 | 75.7 | ||
\ DAT & F0_clf | 29.0 | 75.2 | ||
|
29.1 | 75.8 | ||
vanilla DAE + variational | KL weight: 1.0 | 30.2 | 76.7 | |
KL weight: 0.1 | 29.2 | 76.4 | ||
KL weight: 0.01 | 29.0 | 76.1 |

(converted vs original & ).
4.2 Evaluation of converted acoustic features
Since paired speech data are not available, i.e., parallel utterances of the same content are not available from adult and child domains, spectral distance measures like the Mel cepstral distortion (MCD) are not applicable. An adult-child classification model is trained to distinguish the three domains, i.e., , and . We hypothesize that the converted features are obtained from a high-quality conversion process if they are classified into the domain. The percentages of different types of converted features classified as are shown as in Table 4. The CycleGAN appears to perform very well, having converted features classified as . However, the ASR model trained with these features shows performance degradation on test speech from . Generally, DNN-based conversion methods are able to generate a high percentage of features classified as . This may be related to the robustness issue of deep classification model. The pearson correlation coefficient between the WERs and classification percentages is .
In view of distinctive F0 levels in adult and child speech, the utterance-wide median F0 values of these two domains are estimated. The F0 distributions are visualized in Figure 4 using 100 utterances from different types of acoustic features, including the converted and the unconverted ones. The ASR is expected to have better performance if the F0 distribution of the converted features is close to that of the . The 1D Wasserstein distance is adopted to measure the discrepancy of two F0 distributions [39], which are listed in the last column of Table 4. The pearson correlation coefficient with WERs is .
|
|
|
|
||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
None | 29.0 | 0.0 | 41.8 | ||||||||
DAE | 28.5 | 95.7 | 2.8 | ||||||||
CycleGAN | 31.7 | 100.0 | 126.9 | ||||||||
F0-norm | 28.3 | 69.1 | 5.5 | ||||||||
Stats | 30.3 | 57.5 | 19.7 | ||||||||
Coral | 30.3 | 51.7 | 19.7 |

(converted vs original & ).
5 Conclusion
In this paper, we compare the efficacies of different conversion methods to conduct adult-to-child conversion in the acoustic feature space. The DAE-based conversion framework is investigated in detail with various settings, in which the DAT and F0-guided training are shown to be useful. In addition, using an adult-child deep classification model to judge the quality of conversion is less reliable. The distance of the converted feature set’s F0 distribution to that of the target child domain presents a high correlation with the WER performance.
References
- [1] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
- [2] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. Courville, “Towards end-to-end speech recognition with deep convolutional neural networks,” arXiv preprint arXiv:1701.02720, 2017.
- [3] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proc. of ICASSP. IEEE, 2017, pp. 4835–4839.
- [4] D. Wang, X. Wang, and S. Lv, “An overview of end-to-end automatic speech recognition,” Symmetry, vol. 11, no. 8, p. 1018, 2019.
- [5] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” arXiv preprint arXiv:1706.02737, 2017.
- [6] A. Zeyer, P. Bahar, K. Irie, R. Schlüter, and H. Ney, “A comparison of transformer and lstm encoder decoder models for asr,” in Proc. of ASRU. IEEE, 2019.
- [7] F. Claus, H. Gamboa Rosales, R. Petrick, H.-U. Hain, and R. Hoffmann, “A survey about asr for children,” in Speech and Language Technology in Education, 2013.
- [8] G. Yeung and A. Alwan, “On the difficulties of automatic speech recognition for kindergarten-aged children,” Interspeech, 2018.
- [9] P. G. Shivakumar and P. Georgiou, “Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations,” Computer speech & language, vol. 63, p. 101077, 2020.
- [10] S. Shahnawazuddin, A. Kumar, V. Kumar, S. Kumar, and W. Ahmad, “Robust children’s speech recognition in zero resource condition,” Applied Acoustics, vol. 185, p. 108382, 2022.
- [11] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
- [12] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of children’s speech: Developmental changes of temporal and spectral parameters,” The Journal of the Acoustical Society of America, vol. 105, no. 3, pp. 1455–1468, 1999.
- [13] P. G. Shivakumar and S. Narayanan, “End-to-end neural systems for automatic children speech recognition: An empirical study,” Computer Speech & Language, vol. 72, p. 101289, 2022.
- [14] S. Das, D. Nix, and M. Picheny, “Improvements in children’s speech recognition performance,” in Proc. of ICASSP, vol. 1. IEEE, 1998, pp. 433–436.
- [15] A. Potamianos, S. Narayanan, and S. Lee, “Automatic speech recognition for children,” in Fifth European Conference on Speech Communication and Technology, 1997.
- [16] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International conference on machine learning. PMLR, 2015, pp. 1180–1189.
- [17] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial training for accented speech recognition,” in Proc. of ICASSP. IEEE, 2018, pp. 4854–4858.
- [18] J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” arXiv preprint arXiv:1904.05742, 2019.
- [19] S. Yuan, P. Cheng, R. Zhang, W. Hao, Z. Gan, and L. Carin, “Improving zero-shot voice style transfer via disentangled representation learning,” arXiv preprint arXiv:2103.09420, 2021.
- [20] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” arXiv preprint arXiv:1803.02991, 2018.
- [21] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning. PMLR, 2019, pp. 5210–5219.
- [22] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. of ICCV, 2017, pp. 2223–2232.
- [23] T. Kanek o and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2100–2104.
- [24] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in Proc. of ICASSP. IEEE, 2019, pp. 6820–6824.
- [25] L. Prananta, B. M. Halpern, S. Feng, and O. Scharenborg, “The effectiveness of time stretching for enhancing dysarthric speech for improved dysarthric speech recognition,” arXiv preprint arXiv:2201.04908, 2022.
- [26] G. Yeung, R. Fan, and A. Alwan, “Fundamental frequency feature warping for frequency normalization and data augmentation in child automatic speech recognition,” Speech Communication, vol. 135, pp. 1–10, 2021.
- [27] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501–1510.
- [28] X. Li, C. Lin, R. Li, C. Wang, and F. Guerin, “Latent space factorisation and manipulation via matrix subspace projection,” in International Conference on Machine Learning. PMLR, 2020, pp. 5916–5926.
- [29] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proc. of O-COCOSDA. IEEE, 2017, pp. 1–5.
- [30] F. Yu, Z. Yao, X. Wang, K. An, L. Xie, Z. Ou, B. Liu, X. Li, and G. Miao, “The slt 2021 children speech recognition challenge: Open datasets, rules and baselines,” in IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 1117–1123.
- [31] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [33] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Maskcyclegan-vc: Learning non-parallel voice conversion with filling in frames,” in Proc. of ICASSP. IEEE, 2021, pp. 5919–5923.
- [34] A. I. Mezza, E. A. Habets, M. Müller, and A. Sarti, “Unsupervised domain adaptation for acoustic scene classification using band-wise statistics matching,” in 2020 28th European Signal Processing Conference (EUSIPCO). IEEE, 2021, pp. 11–15.
- [35] B. Sun, J. Feng, and K. Saenko, “Correlation alignment for unsupervised domain adaptation,” in Domain Adaptation in Computer Vision Applications. Springer, 2017, pp. 153–171.
- [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [37] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
- [38] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
- [39] J. Altschuler, J. Niles-Weed, and P. Rigollet, “Near-linear time approximation algorithms for optimal transport via sinkhorn iteration,” Advances in neural information processing systems, vol. 30, 2017.