HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models
Abstract
Recent advancements in integrating Large Language Models (LLM) with automatic speech recognition (ASR) have performed remarkably in general domains. While supervised fine-tuning (SFT) of all model parameters is often employed to adapt pre-trained LLM-based ASR models to specific domains, it imposes high computational costs and notably reduces their performance in general domains. In this paper, we propose a novel parameter-efficient multi-domain fine-tuning method for adapting pre-trained LLM-based ASR models to multi-accent domains without catastrophic forgetting named HDMoLE, which leverages hierarchical routing and dynamic thresholds based on combining low-rank adaptation (LoRA) with the mixture of experts (MoE) and can be generalized to any linear layer. Hierarchical routing establishes a clear correspondence between LoRA experts and accent domains, improving cross-domain collaboration among the LoRA experts. Unlike the static Top-K strategy for activating LoRA experts, dynamic thresholds can adaptively activate varying numbers of LoRA experts at each MoE layer. Experiments on the multi-accent and standard Mandarin datasets demonstrate the efficacy of HDMoLE. Applying HDMoLE to an LLM-based ASR model projector module achieves similar performance to full fine-tuning in the target multi-accent domains while using only 9.6% of the trainable parameters required for full fine-tuning and minimal degradation in the source general domain.
Index Terms:
Hierarchical routing, dynamic thresholds, HDMoLE.I Introduction
Large language models (LLM) [1, 2, 3, 4] have garnered widespread attention across various fields due to their exceptional language understanding and generation capabilities. Extensive research has focused on exploring the potential of LLM in various fields, particularly in automatic speech recognition (ASR). Recent developments in combining LLM with ASR have led to outstanding performance, with the paradigm of augmenting a speech foundation model with an LLM through a projector module becoming the prevailing framework for LLM-based ASR [5, 6, 7]. However, most LLM-based ASR models focus solely on general domain speech recognition and encounter numerous errors when confronted with speech under challenging acoustic conditions such as background noise [8] and speaker accents [9]. While supervised fine-tuning (SFT) all parameters offers a direct approach to adapting pre-trained LLM-based ASR models to target specific acoustic domains, it requires high computational resources to retrain large LLM-based ASR models and results in considerable performance degradation in source general domains [10, 11].
Parameter-efficient fine-tuning strategies [12, 13, 14, 15, 16] seek to adapt large models to specific domains by fine-tuning only a minimal portion of model parameters, significantly reducing computational costs while minimizing the risk of catastrophic forgetting. Low-rank adaptation (LoRA) [17] stands out among the various strategies because it improves adaptability without changing the original model parameters. LoRA employs low-rank decomposition to achieve weight updates via smaller matrices, allowing the model to adapt to new domains while keeping the original weight matrices unchanged, thus providing a parameter-efficient method for model adaptation. However, more than one LoRA is needed in practical applications to fulfill user expectations. The mixture of Experts (MoE) [18, 19, 20, 21, 22] is an ensemble method commonly viewed as a collection of sub-networks (experts), each focusing on different domains, with a trainable gating network (router) assigning weights to these experts.
Drawing inspiration from MoE, numerous researchers regard LoRA as a domain expert to overcome the challenges encountered by large models in real-world multi-domain scenarios. MOELoRA [23] and MOA [24] employ domain-specific LoRA experts and explicit routing strategies to accommodate diverse domains. SiRA [25] and MixLoRA [26] introduce sparse MoE mechanisms with specialized routing or load-balancing techniques to enhance efficiency while maintaining performance. MoRAL [27] tackles the challenge of adapting LLMs to new domains while enabling them to become efficient lifelong learners. LoRAMoE [11] integrates several LoRA experts through an MoE-style plugin to mitigate world knowledge forgetting in LLM during SFT. Despite the valuable insights these studies provide into the fusion of MoE and LoRA, several challenges persist. Firstly, the high coefficient of variation in unconstrained MoE layers reflects that the router consistently assigns larger weights to the same few experts [18]. The imbalance of the experts’ utilization is a typical problem in MoE, which indicates that the correspondence between LoRA experts and domains is unclear. Secondly, the static Top-K expert selection strategy constrains the adaptability of MoE, highlighting the requirement for more dynamic expert selection strategies to address the complexities of different domains [28].
This paper explores applying the mixture of LoRA experts (MoLE) to pre-trained LLM-based ASR models to improve their capabilities in handling the challenging multi-accent domains. Accents represent deviations from standard pronunciation norms influenced by the speaker’s educational background, geographical region, or native language [29], leading to significant performance degradation in pre-trained LLM-based ASR models. To this end, we propose a novel parameter-efficient fine-tuning method named HDMoLE that adapts pre-trained LLM-based ASR models to multi-accent domains without catastrophic forgetting by leveraging MoLE combined with hierarchical routing and dynamic thresholds. MoLE allows for the parameter-efficient fine-tuning of large models across multiple accent domains, effectively mitigating catastrophic forgetting. Hierarchical routing includes global and local routing, which clarifies the correspondence between LoRA experts and accent domains while improving cross-domain collaboration among LoRA experts by assigning optimal combination weights. Through dynamic thresholds, varying quantities of LoRA experts can be selected in each MoLE layer, with unsuitable experts discarded and higher weights reassigned to the more suitable ones. In summary, the contributions of this paper are as follows:
-
•
To our knowledge, HDMoLE is the first attempt to explore parameter-efficient multi-domain adaptation for pre-trained LLM-based ASR models that can be applied to any linear layer.
-
•
HDMoLE employs hierarchical routing and dynamic thresholds based on MoLE to clarify the correspondence between LoRA experts and accent domains, dynamically select suitable experts, and allocate higher weights to them.
-
•
Extensive experiments demonstrate that HDMoLE achieves character error rate (CER) results comparable in the target multi-accent domains to full fine-tuning while using only 9.6% of the training parameters required for full fine-tuning, with minimal degradation in the source general domain.

II Proposed Methods
This section will provide a detailed introduction to HDMoLE. While HDMoLE can be generalized to any linear layer, we use it in the projector module of a recently released LLM-based ASR model with the structure of Hubert+Baichuan2 [7], chosen for its relatively compact size and alignment of accented speech and text modalities.
II-A Preliminaries
Low-Rank Adaptation. LoRA [17] is an exceptional parameter-efficient fine-tuning method for adapting pre-trained large models to specific domains. It reduces the number of trainable parameters by updating low-rank decomposition matrix pairs while maintaining the original weights unchanged. Specifically, for a given linear layer with the weight matrix , LoRA employs two low-rank matrices and with rank , where , , , and . With the application of LoRA, the forward process for the given linear layer can be impressed as follows:
(1) |
where low-rank matrices and are trainable while the original weight and bias remain unchanged during training. The matrix is initialized with a random Gaussian distribution, and matrix starts from zero. Scaling by controls the extent of adjustments to the original weights imposed by LoRA, with and representing constants.
Mixture of Experts. MoE [18, 19, 20] framework scales model capacity and complexity by incorporating multiple sub-network experts, each potentially addressing specific domains or tasks. Within an MoE layer, independent experts are coordinated by a gating network router, which applies a trainable matrix to generate a probability distribution for weighting the outputs of these experts, employing a softmax function for normalization. For the given input vector , the output probability distribution of the router can be impressed as:
(2) |
where represents the trainable weights of the gating network router. The final output from the MoE layer is a weighted sum of the outputs from the top experts:
(3) |
where and are the weights of and expert in MoE layer, respectively. The TopK function identifies and retains the highest weights, setting the rest to zero. The weights retained by the TopK function are normalized to ensure their sum equals one.
II-B Hierarchical Routing
The original intention of MoE is that the routers determine weights for each expert according to the input sample, with higher weights given to experts focused on the input domain, thereby establishing a clear correspondence between experts and domains. However, routers in MoE layers tend to converge to a state where it always produces large weights for early-stage well-performing experts, leading to only a handful of experts having a significant impact [10]. In other words, the experts assigned larger weights by the MoE router essentially remain constant for inputs across different domains. Consequently, this imbalance of the experts’ utilization problem causes ambiguity in the correspondence between experts and domains. To address this, we propose a hierarchical routing strategy to clarify the correspondence between experts and domains. Specifically, hierarchical routing comprises global and local routing with a pre-trained accent recognition (AR) model as the global router and individual MoE layer routers as local routers. The global routing explicitly guides each expert to focus on a specific accent domain, clarifying the correspondence between experts and accent domains and establishing them as domain-specific experts. Meanwhile, the local routing implicitly guides the experts within the MoE layer to collaborate across multiple accent domains through a learnable gating network router. The input speech features generate global weights through the global router, which can be impressed as follows:
(4) |
where the global router is frozen during training and inferring. Subsequently, the global weights are input into each MoE layer. In each MoE layer, the hidden speech features produce local weights via the local router, which can be impressed as follows:
(5) |
where the local router is a trainable linear layer.
II-C Dynamic Thresholds
The standard MoE layer employs the static Top-K expert selection strategy, choosing experts with the highest weights determined by the router. Given that each MoE layer concentrates on distinct aspects of domains, varying numbers of experts are required to participate in different MoE layers. Therefore, we propose a dynamic threshold expert selection strategy, replacing the static Top-K approach by implementing dynamic thresholds. The dynamic thresholds strategy allows each MoE layer to select the experts that need to be activated flexibly. This strategy requires defining a threshold, where experts with weights exceeding the threshold are selected. It is essential to carefully initialize the threshold, as a highly initialized threshold may cause all expert weights to fall below it, resulting in no experts being selected. Therefore, appropriate initialization of the threshold is indispensable. Here, the threshold is initialized at to ensure that at least one expert is selected. In each HDMoLE, we assign two independent dynamic thresholds to the global and local weights, respectively. The quantities of global and local thresholds are identical. Through the global threshold , we can obtain the global adapted weights , which can be impressed as follows:
(6) |
where (condition) equals one if the condition is true and zero otherwise, represents the weight of the expert in the global weights. Additionally, scaling the adapted weights by guarantees that the remains learnable during backpropagation. Similarly, the local threshold allows us to obtain the local adapted weights , which can be impressed as follows:
(7) |
where represents the weight of the expert in the local weights. The final adapted weights are the sum of the global and local adapted weights, which can be defined as follows:
(8) |
II-D Mixture of LoRA Experts
MoLE substitutes conventional dense layer experts with LoRA experts, rendering it a parameter-efficient fine-tuning approach. The final output of MoLE is a combination of the weighted outputs from the LoRA experts and the original model. After applying hierarchical routing and dynamic thresholds, the final adapted weights are obtained. Therefore, the output of the MoLE layer can be expressed as follows:
(9) |
Notably, when the weight of a certain LoRA expert is simultaneously lower than the global and local thresholds, the LoRA expert’s weight in the final adapted weights becomes zero.
III Experiments
Method | Finetune | Trainable Param. | KeSpeech CER | AISHELL-2 CER |
---|---|---|---|---|
Hubert+Baichuan2 [7] | No | - | 25.65 | 3.50 |
Hubert+Baichuan2 [7] | Full Projector | 51M | 15.64 | 4.91 |
LoRA [17] | LoRA Expert | 0.31M | 19.98 | 4.05 |
MOELoRA [23] | LoRA Experts & Local Routers | 4.92M | 19.95 | 4.66 |
LoRAMoE [11] | LoRA Experts & Local Routers | 1.99M | 18.76 | 4.33 |
MoRAL [27] | LoRA Experts & Local Routers | 1.99M | 19.28 | 4.35 |
MoA [24] | LoRA Experts & Local Routers | 4.92M | 18.80 | 4.47 |
HDMoLE (Ours) | LoRA Experts & Local Routers & Thresholds | 4.92M | 16.58 | 3.69 |
w/o Dynamic Thresholds | LoRA Experts & Local Routers | 4.92M | 17.39 | 3.82 |
w/o Local Routing | LoRA Experts | 4.37M | 18.33 | 3.86 |
w/o Global Routing | LoRA Experts & Local Routers | 4.92M | 18.76 | 3.94 |
III-A Experimental Setup
Datasets. We conduct experiments using the multi-accent Mandarin KeSpeech [30] and the standard Mandarin AISHELL-2 [31] datasets to evaluate the effectiveness of HDMoLE. KeSpeech involves 1,542 hours of speech recorded by 27,237 speakers from 34 cities in China, and the pronunciation includes standard Mandarin and eight major accented Mandarin. All of our HDMoLE experiments are trained solely on the KeSpeech dataset. We use the KeSpeech test dataset to evaluate the performance of HDMoLE in the target multi-accent domains. In contrast, the AISHELL-2 test dataset measures performance degradation in the source general domain.
Settings. HDMoLE is implemented in the projector module of a pre-trained LLM-based ASR model with the Hubert+Baichuan2 structure [7]. This model uses over 11,000 hours of standard Mandarin Chinese data from four general domain corpora as training datasets, including WenetSpeech [32], AISHELL-1 [33], AISHELL-2 [31], and AISHELL-4 [34], excluding multi-accent Mandarin corpora. The projector module is a 4-layer Transformer [35] with 51M parameters, where feed-forward networks (FFN) have 2560 dimensions, multi-head self-attentions (MHSA) have 256 dimensions, and the attention head is 4. Each HDMoLE has 8 LoRA experts, each with a rank and an alpha value of 8. We apply HDMoLE to the FFN layers and the four weight matrices of the MHSA in the projector module. The global router pre-trained frozen AR model is a 12-layer Conformer [36] with 31M parameters trained solely on KeSpeech, where the attention head is 4, and the dimensions of FFN and MHSA are respectively set to 2048 and 256, achieving an AR accuracy of 81.44% on the KeSpeech test dataset.
AR Accuracy | 62.97 | 81.44 | 100 |
---|---|---|---|
KeSpeech CER | 18.11 | 16.58 | 15.35 |
LoRA Rank | 4 | 8 | 16 | 32 | 64 |
---|---|---|---|---|---|
Trainable Param. | 2.73M | 4.92M | 9.29M | 18.05M | 35.55M |
KeSpeech CER | 16.94 | 16.58 | 16.07 | 15.92 | 15.98 |
III-B Results and Discussion
Main Results. Table I presents the CER results on the KeSpeech and AISHELL-2 test datasets across various methods. The first line shows the inference results on the KeSpeech and AISHELL-2 test datasets using the pre-trained Hubert+Baichuan2 model, with poor performance on KeSpeech attributed to the lack of accented speech corpora during pre-training. The second line displays the results of fully fine-tuning the Hubert+Baichuan2 model projector module. The significant improvement on the KeSpeech test dataset is the target domain topline for HDMoLE. Meanwhile, the regression in AISHELL-2 performance indicates that full fine-tuning causes degradation in the model’s source domain performance. Lines 3 to 7 present the results of the original LoRA and various MoLE methods on KeSpeech and AISHELL-2, none of which match the performance of HDMoLE in both the target and source domains. Lines 8 to 11 present the ablation study of HDMoLE, demonstrating the necessity of the two strategies employed. In summary, HDMoLE fine-tunes the Hubert+Baichuan2 model projector module using fewer trainable parameters, achieving performance close to full fine-tuning in the target domain while minimizing regression in the source domain.
Various Global Routers. Table II illustrates the influence of global routers with different performances on the efficacy of HDMoLE. Our experiments involved three kinds of global routers: non-convergent AR model, convergent AR model, and ground-truth accent labels. We observe that improvements in the global router’s performance lead to enhanced performance for HDMoLE. This phenomenon indicates that improved performance of the global router clarifies the correspondence between LoRA experts and accent domains, enabling each expert to concentrate more effectively on its specific domain.

Rank of LoRA Experts. Table III examines the impact of LoRA rank variations on the performance of HDMoLE. With higher LoRA ranks, the parameter for each LoRA expert grows, consequently elevating the number of trainable parameters in HDMoLE, leading to noticeable improvements in performance. However, when the LoRA rank increases beyond 16, the performance improvement of HDMoLE diminishes, and even regression may occur.
III-C Visualization of Dynamic Thresholds
Figure 2 shows the average number of activated LoRA experts for every projector layer. HDMoLE in each projector layer concentrates on distinct aspects, requiring a dynamic activation of LoRA experts based on the layer’s focus. HDMoLE in lower projector layers tends to activate more experts, signifying that these layers are essential for processing general speech features, which are more complex and diverse, thus requiring comprehensive domain knowledge. Conversely, HDMoLE in higher projector layers tends to activate fewer experts, indicating that these layers focus on processing domain-specific speech features, demanding more specialized domain knowledge.
IV Conclusions
This paper proposes a novel parameter-efficient fine-tuning method for adapting pre-trained LLM-based ASR models to the target multi-accent domains without catastrophic forgetting named HDMoLE, which employs hierarchical routing and dynamic thresholds based on MoLE and can be applied to any linear layer. Hierarchical routing clarifies the correspondence between LoRA experts and accent domains, improving cross-domain collaboration among LoRA experts, while dynamic thresholds adaptively determine the varying number of activated LoRA experts at each MoLE layer. Extensive experiments demonstrate that HDMoLE achieves comparable results to full fine-tuning by using only 9.6% of the training parameters required for full fine-tuning in the target multi-accent domains while minimizing regression in the source general domain. In the future, we will explore applying HDMoLE across all LLM-based ASR models.
References
- [1] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proc. NeurIPS, 2022.
- [2] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
- [3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv preprint arXiv:2307.09288, 2023.
- [4] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang et al., “Qwen2 Technical Report,” arXiv preprint arXiv:2407.10671, 2024.
- [5] Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y. Leng, Y. Lv, J. He, J. Lin et al., “Qwen2-Audio Technical Report,” arXiv preprint arXiv:2407.10759, 2024.
- [6] Y. Bai, J. Chen, J. Chen, W. Chen, Z. Chen, C. Ding, L. Dong, Q. Dong, Y. Du, K. Gao et al., “Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition,” arXiv preprint arXiv:2407.04675, 2024.
- [7] X. Geng, T. Xu, K. Wei, B. Mu, H. Xue, H. Wang, Y. Li, P. Guo, Y. Dai, L. Li, M. Shao, and L. Xie, “Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets,” arXiv preprint arXiv:2405.02132, 2024.
- [8] B. Mu, P. Guo, D. Guo, P. Zhou, W. Chen, and L. Xie, “Automatic Channel Selection and Spatial Feature Integration for Multi-Channel Speech Recognition Across Various Array Topologies,” in Proc. ICASSP, 2024, pp. 11 396–11 400.
- [9] B. Mu, X. Wan, N. Zheng, H. Zhou, and L. Xie, “MMGER: Multi-Modal and Multi-Granularity Generative Error Correction With LLM for Joint Accent and Speech Recognition,” IEEE Signal Processing Letters, vol. 31, pp. 1940–1944, 2024.
- [10] X. Wu, S. Huang, and F. Wei, “Mixture of lora experts,” in Proc. ICLR, 2024.
- [11] S. Dou, E. Zhou, Y. Liu, S. Gao, W. Shen, L. Xiong, Y. Zhou, X. Wang, Z. Xi, X. Fan, S. Pu, J. Zhu, R. Zheng, T. Gui, Q. Zhang, and X. Huang, “LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models via MoE-Style Plugin,” in Proc. ACL, 2024, pp. 1932–1945.
- [12] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. Raffel, “Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning,” in Proc. NeurIPS, 2022.
- [13] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-Efficient Transfer Learning for NLP,” in Proc. ICML, 2019, pp. 2790–2799.
- [14] X. L. Li and P. Liang, “Prefix-Tuning: Optimizing Continuous Prompts for Generation,” in Proc. ACL/IJCNLP, 2021, pp. 4582–4597.
- [15] B. Lester, R. Al-Rfou, and N. Constant, “The Power of Scale for Parameter-Efficient Prompt Tuning,” in Proc. EMNLP, 2021, pp. 3045–3059.
- [16] E. B. Zaken, Y. Goldberg, and S. Ravfogel, “BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models,” in Proc. ACL, 2022, pp. 1–9.
- [17] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” in Proc. ICLR, 2022.
- [18] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean, “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” in Proc. ICLR, 2017.
- [19] W. Fedus, J. Dean, and B. Zoph, “A Review of Sparse Expert Models in Deep Learning,” arXiv preprint arXiv:2209.01667, 2022.
- [20] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus, “ST-MoE: Designing Stable and Transferable Sparse Expert Models,” arXiv preprint arXiv:2202.08906, 2022.
- [21] Y. Xie, S. Huang, T. Chen, and F. Wei, “Moec: Mixture of expert clusters,” in Proc. AAAI, 2023, pp. 13 807–13 815.
- [22] X. Song, D. Wu, B. Zhang, D. Zhou, Z. Peng, B. Dang, F. Pan, and C. Yang, “U2++ MoE: Scaling 4.7 x parameters with minimal impact on RTF,” arXiv preprint arXiv:2404.16407, 2024.
- [23] Q. Liu, X. Wu, X. Zhao, Y. Zhu, D. Xu, F. Tian, and Y. Zheng, “MOELoRA: An MOE-based Parameter Efficient Fine-Tuning Method for Multi-task Medical Applications,” arXiv preprint arXiv:2310.18339, 2023.
- [24] W. Feng, C. Hao, Y. Zhang, Y. Han, and H. Wang, “Mixture-of-LoRAs: An Efficient Multitask Tuning Method for Large Language Models,” in Proc. LREC-COLING, 2024, pp. 11 371–11 380.
- [25] Y. Zhu, N. Wichers, C.-C. Lin, X. Wang, T. Chen, L. Shu, H. Lu, C. Liu, L. Luo, J. Chen et al., “SiRA: Sparse Mixture of Low Rank Adaptation,” arXiv preprint arXiv:2311.09179, 2023.
- [26] D. Li, Y. Ma, N. Wang, Z. Cheng, L. Duan, J. Zuo, C. Yang, and M. Tang, “MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts,” arXiv preprint arXiv:2404.15159, 2024.
- [27] S. Yang, M. A. Ali, C.-L. Wang, L. Hu, and D. Wang, “MoRAL: MoE Augmented LoRA for LLMs’ Lifelong Learning,” arXiv preprint arXiv:2402.11260, 2024.
- [28] Z. Liu and J. Luo, “AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts,” arXiv preprint arXiv:2405.00361, 2024.
- [29] N. Markl and C. Lai, “Everyone has an accent,” in Proc. Interspeech, 2023, pp. 4424–4427.
- [30] Z. Tang, D. Wang, Y. Xu, J. Sun, X. Lei, S. Zhao, C. Wen, X. Tan et al., “KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects,” in Proc. NeurIPS Datasets and Benchmarks Track, 2021.
- [31] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale,” arXiv preprint arXiv:1808.10583, 2018.
- [32] B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng et al., “WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition,” in Proc. ICASSP, 2022, pp. 6182–6186.
- [33] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline,” in Proc. O-COCOSDA, 2017, pp. 1–5.
- [34] Y. Fu, L. Cheng, S. Lv, Y. Jv, Y. Kong, Z. Chen, Y. Hu, L. Xie, J. Wu, H. Bu, X. Xu, J. Du, and J. Chen, “AISHELL-4: an open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,” in Proc. Interspeech, 2021, pp. 3665–3669.
- [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Proc. NeurIPS, 2017, pp. 5998–6008.
- [36] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.