This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

TODM: Train Once Deploy Many
Efficient Supernet-Based RNN-T Compression for on-device ASR models

Abstract

Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model’s hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropout, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.

Index Terms—  Supernet, RNN-T, on-device, efficiency, compression, knowledge distillations

1 Introduction

End-to-end (E2E) all-neural automatic speech recognition (ASR) has gained attention for its compatibility with edge devices [1, 2, 3]. Recurrent neural network transducer (RNN-T) is one of the most popular architectures for on-device E2E ASR [4, 5, 6, 7]. Modern applications of RNN-T ASR run on a variety of hardware, including central processing units (CPUs) on phones, tensor processing units (TPUs) [8], and other neural accelerators. The process of optimizing on-device ASR models efficiently for each distinct hardware configuration necessitates fine-tuning model training hyperparameters and streamlining the model’s architecture [9]. This demands a substantial amount of computational resources, primarily in the form of GPU-hours or GPU cluster energy expenses in MegaWatt-hour (MWh). Researchers must balance accuracy and size while efficiently managing training resources.

A potential approach is to leverage the Supernet [10, 11, 12, 13, 14]. A Supernet is a weight-sharing neural network graph that can generate smaller subnetworks tailored for specific applications. Prior work has attempted to facilitate the simultaneous training of a fixed number of models by implementing weight sharing capabilities. Examples include work on RNN-T cascaded encoders [8, 15]. These efforts, however, typically involve a limited number of networks, often around three, that only share portions of the RNN-T encoders. Their primary objective is to streamline management for a few models, while utilizing distinct model decoders to improve model performances. Some also combine non-causal and causal encoders to enhance model accuracy. The training of such cascaded encoder systems consumes much more resources compared to a single model-training task, and it can not scale to a large number of different sizes of models. Besides cascaded encoders, our previous work explored the concept of Supernet in the Omni-Sparsity DNN framework [16], in which a large number of ASR subnetworks could be derived from one Supernet training job. Furthermore, there have been efforts to expand the use of Omni-Sparsity DNN to train ASR models while accommodating various latency constraints for the subnetworks. This was demonstrated by training a non-streaming dense Supernet with sparse streaming subnetworks in [17]. Omni-Sparsity DNNs’ reliance on structured sparsity, however, unnecessarily limits their hardware compatibility.

This paper presents Train Once Deploy Many (TODM), an approach to efficiently train many hardware-friendly on-device ASR models with comparable GPU-hours to a single training job. TODM leverages insights from prior work on Omni-sparsity DNN Supernet, but does not rely on sparsity. Instead, it drops layers and reduces layer widths in the Supernet to obtain optimized subnetworks for all hardware types. We explain TODM Supernet in Section 2. We improve its training outcome by introducing adaptive dropout, an in-place logit-sampled Alpha-divergence knowledge distillation mechanism, and ScaledAdam optimizer (Section 3). We validate our approach by generating Supernet-trained sub-models using Evolutionary Search (described in Section 2.2) on the validation set and comparing their performances to individually optimized ASR models using LibriSpeech [18] in Section 4. We discuss the results and why Supernet can function efficiently in Section 5. This work is the first to use Supernet training to efficiently produce a variety of on-device ASR models. With similar training resources as one model, we can deploy many of different sizes, each with the optimal quality at its desired size.

2 Efficient Supernet RNN-T Compression

In this section, we describe how a Supernet RNN-T generates subnetwork RNN-Ts of dynamic sizes by varying the width and height of the encoders. It then uses evolutionary search [16] to discover the best encoder architecture of RNN-T at different size constraints.

2.1 Constrained Layer and Width Reduction

We experiment with the RNN-T-based ASR models [5]. Supernet training is orthogonal to the type of model we select, and could trivially extend to CTC-based ASR encoders or E2E sequence-to-sequence models. Our design of the subnetwork search space is pivotal to the success of Supernet training. Through empirical exploration, we outline three guiding principles for designing the architecture search space for the best Supernet outcomes; the search space should be:

  1. 1.

    large enough to find good models of any desired size;

  2. 2.

    flexible enough to discovery new architectures;

  3. 3.

    architecture-aware, focusing on major parameter redundancy in the model.

Refer to caption
(a)
Refer to caption
(b)
Fig. 1: (a) MH-SSM layer: during Supernet training, subnetworks are created by reducing the output channel sizes of FFN modules (dotted lines) or dropping the entire layer. (b) Encoder channel-and-layer reduction of a 49.6MB Pareto subnetwork for model F. blue: remaining FFN channels, gray: reduced layers and channels.

In RNN-T, the encoder typically occupies the bulk of the parameters. For example, in our baseline model, the encoder occupies 86.9% of the parameters, followed by the predictor (8.9%) and joiner (4.2%). Therefore, we only use Supernet to find subnetworks of the encoders, while keeping the joiner and decoder unchanged.

Let Θ\Theta be the Supernet model parameters, and LnLnL_{n}\in\mathbb{Z}_{L}^{n} denote the list of layers in the RNN-T encoders; the Supernet RNN-T encoder has a maximum of nn layers. We define the width of module ii, which typically refers to the output channel size of feed-forward networks or linear projection modules, as CiCmiC_{i}\in{\mathbb{Z}_{C}}^{m_{i}}, where this layer has maximum output size of mim_{i}. A subnetwork can be generated by selecting a subset of layers and a subset of channels from the search space ¯\bar{\mathbb{Z}}.

¯0ICmiLn\bar{\mathbb{Z}}\coloneqq\bigcup_{0}^{I}\mathbb{Z}_{C}^{m_{i}}\cup\mathbb{Z}_{L}^{n} (1)

where LL operates on a layer level, and CC operates on the sub-layer modular level. To make the search space more manageable, we further constrain the layer and width trimming to a monotonic relationship. That means, if lil_{i} is trimmed, then subsequent li+1,,lnl_{i+1},...,l_{n} layers are all removed; similarly, for channels c0,,cmic_{0},...,c_{m_{i}}, if the cic_{i} channel is trimmed, all subsequent output channels are also removed.

During training, we perform a sandwich sampling method [16, 13]. We sample four sizes of subnetworks: the maximum (i.e., the entire Supernet), the minimum, and two random sizes (i.e. random selections of CiC_{i}s and LiL_{i}s from ¯\bar{\mathbb{Z}}). Unlike prior work, we feed to the maximum subnetwork the entire batch of input during the forward step, while to the other three subnetworks a mini-batch that is 1/4 of the batch. This keeps the quality of the maximum network high while keeping the training GPU hours reasonable. We optimize the following loss function:

minΘ𝔼s¯[𝔼(𝐱,𝐲)Dtrainrnnt(𝐲|𝐱;Θs)]\underset{\Theta}{min}\mathbb{E}_{s\sim\bar{\mathbb{Z}}}\left[\mathbb{E}_{(\mathbf{x,y^{\ast}})\sim D_{train}}\mathcal{L}_{rnnt}(\mathbf{y^{\ast}}|\mathbf{x};\Theta_{s})\right] (2)

where ss is the search space of the Supernet, rnnt(𝐲|𝐱;Θs)\mathcal{L}_{rnnt}(\mathbf{y^{\ast}}|\mathbf{x};\Theta_{s}) is the transducer loss [4] with respect to the correct output sequence yy^{\ast} given input features xx from the training data DtrainD_{train}, computed using subnetwork Θs\Theta_{s}.

2.2 Post-training Supernet Pareto Search

We use evolutionary search to discover top-performing subnetworks from a search space of various subnetwork options. The validation dataset WER is used as the fitness score [19]. The Supernet evolutionary search can dynamically quantize subnetworks, evaluate them on CPUs, and thus consumes 100×100\times less resources than model training. The outcome is a collection of subnetwork configurations, each optimizing accuracy over size constraints τ¯=[τ1,τ2,,τt]\bar{\mathbb{\tau}}=\left[\tau_{1},\tau_{2},...,\tau_{t}\right]:

{argminsiWERsi¯,E(𝐱,𝐲)Dval(𝐲,𝐱,Θsi), s.t. 𝕄(si)τi}\{\arg\min_{s_{i}}\text{\small WER}_{s_{i}\sim\bar{\mathbb{Z}},E_{(\mathbf{x,y^{\ast}})\sim D_{val}}}(\mathbf{y^{\ast}},\mathbf{x},\Theta_{s_{i}}),\text{ \small s.t. }\mathbb{M}(s_{i})\leq\tau_{i}\} (3)

where 𝕄(si)\mathbb{M}(s_{i}) is the model size of subnetwork sis_{i}, and the loss here is word error rate (WER) computed using beam search decoding with beam size=5.

3 Improving Supernet Training

In this section, we outline three key strategies for improving the quality of Supernet-trained models. While each technique has been applied in non-Supernet contexts, their combination is novel.

  1. 1.

    Adaptive Dropout: different subnetworks contribute different magnitudes of gradient L2-norm during Supernet training. Inspired by the adaptive dropout used in Omni-sparsity DNN training [16], we adjust the magnitude of dropouts in the FFN layers, which immediately follows the modules with reduced output channel during training. Intuitively, a module with reduced dimensionality requires less regularization in order to produce a similar output and gradient norm. Therefore, dropoutci=dropoutcmi×cimidropout_{c_{i}}=dropout_{c_{m_{i}}}\times\frac{c_{i}}{m_{i}}, immediately follows the reduced output channel size of module ii, where mim_{i} is the maximum output channel size, and cic_{i} is the current channel size.

  2. 2.

    Sampled In-place Knowledge Distillation (KD): during each step of the Supernet training, we use the entire Supernet as the “teacher” model, and force in-place distillation of the max-network’s output probability distributions onto the sampled subnetwork, i.e. the “student”, with data from each mini-batch. We experiment with two types of output probability divergence functions: (1) the Kullback–Leibler divergence (KLD), which is known to improve the accuracy of a compressed ASR during knowledge distillation [20, 21]; and (2) the Alpha-divergence [14] (AlphaD), which has been shown to better capture the teacher network’s uncertainties in the output probability distributions than KL in a Supernet training setting. Distilling the teacher’s output probability distribution over the entire RNN-T lattice is memory-intensive and slows down training by requiring a smaller batch size. To overcome that, we improve the training memory by subsampling top jj output probabilities from the teacher. In a previous study on efficient RNN-T distillation [22], jj was set to 2, which meant that the distillation focused on only 3 logit dimensions: the target token, the blank token, and the cumulative sum of the remaining probabilities. We hypothesize that this insufficiently captures the uncertainty of the teacher output probabilities. Therefore, in this work, we compare jj values of [2,10,100][2,10,100] (distillation probability dimensions = j+1). The loss function is now:

    \displaystyle\mathcal{L} =λKD+rnnt\displaystyle=\lambda\mathcal{L}_{KD}+\mathcal{L}_{rnnt} (4)
    KD(Θs;Θ)s¯\displaystyle\underset{s\in\bar{\mathbb{Z}}}{\mathcal{L}_{KD}(\Theta_{s};\Theta)} =𝔼xDminibatch[f(pj(x;Θ)||qj(x;Θs)]\displaystyle=\mathbb{E}_{x\in D_{mini-batch}}\left[f(p_{j}(x;\Theta)||q_{j}(x;\Theta_{s})\right] (5)

    where we use default λ=1.0\lambda=1.0, ff is KLD or AlphaD, computed over the sampled probability of top jj tokens, and the sum of the rest (pjp_{j} being the teacher’s distributions, qjq_{j} the student’s). For Alpha-divergence, we use the default setting of α\alpha_{-}=-1, α+\alpha_{+}=1 and β\beta=5.0.

  3. 3.

    ScaledAdam Optimizer: we explore training the Supernet with the ScaledAdam optimizer [23]. As a variant of Adam optimizer, the ScaledAdam scales each parameter’s update based on its norm. Intuitively, the ScaledAdam optimizer improves gradient stability of the Supernet training, where each subnetwork contributes drastically different gradient norms to model parameters during mini-batch forward-backward step.

4 Experiment Setup

Refer to caption
Fig. 2: Energy (MWh) consumed by training models at scale.

In this paper, we validate the results of TODM Supernet on a non-streaming Multi-Head State Space Model (MH-SSM) RNN-T. The encoder of our RNN-T is built with 16 MH-SSM layers; these layers are attention free and have been demonstrated to achieve on-par performances compared to transformer-based ASR [24]. Each MH-SSM layer consists of 2-stacked, 1-headed MH-SSM, and 4096 feed-forward net (FFN) dimension. Our RNN-T architecture follows the example set in prior work [24]: three 512-dimensional LSTM layers plus a linear layer in the predictor; a linear project layer and a ReLU gate in the joiner. In total, our non-streaming MH-SSM RNN-T has 99.9 million parameters. During Supernet training, we selectively reduce entire layers of MH-SSM, or alter the size of the FFN module to Ci[512,1024,2048,4096]C_{i}\in[512,1024,2048,4096]. The FFN consists of 77.8% of the total number of parameters in a MH-SSM layer, and has high potential of parameter redundancy. Layer reduction in MH-SSM supernet is performed as reducing top 0, 3, or 7 MH-SSM layers. See Fig 1(a) for illustration. All our baseline MH-SSM RNN-T (A1,A2,A3A1,A2,A3 in Table 1) are trained with the same hyper-parameter set up: a total of 180 epochs of training, under a fixed 0.006 learning rate, force-anneal at 60 epochs with shrinking factor 0.96; 0.1 weight decay; Adam optimizer with β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999.

We train the RNN-Ts with the LibriSpeech 960 hr training data [25]. We obtain 80-dimensional log Mel-filterbank features from each 25 ms audio window, sliding the window ahead every 10 ms. We use a pre-trained 4096-dimensional sentence piece vocab [26], plus a ‘blank’ symbol, as the RNN-T target. After Supernet training, we use the 10.7h LibriSpeech dev-clean and dev-other data to identify best subnetwork architectures during evolutionary search.

All model training is done with 32 NVIDIA-A100 GPUs; to prevent GPU queue preemption, CPU scheduling wait-times, and other data center down-times from polluting the GPU-hour computation, we report the total energy expenses of each model training in terms of MegaWatt-hour (MWh), measured by the energy consumed and recorded by our GPU cluster.

5 Results and Discussion

We compare our TODM Supernet (Model F) with 4 baselines:

Refer to caption
Fig. 3: LibriSpeech test-other WER vs. model sizes. Our proposed method (Model F) produces subnetworks that are on-par with, or exceeds best individually trained models (especially for models \geq 50 million parameters). The trend lines are second degree polynomial approximations of the results.
  1. 1.

    A1 represents a collection of individually trained and tuned models, the sizes of which are determined by reducing layer sizes, channel sizes, or both, from the largest MH-SSM model. The energy used in training scales proportionally to the number and sizes of the models, we are thus capable to explore only limited architectures.

  2. 2.

    A2 represents the results of the largest model, trained with auxiliary cross-entropy loss [27] at layers 8, 12, and 16; auxiliary training introduced 1.15 million more parameters to the model, but did not end up improving the model performances in Table 1.

  3. 3.

    A3 is the results of the largest model, trained with auxiliary RNN-T loss at layers 8, 12, and 16 (i.e. layer-drop) – each of these layers is accompanied with a unique combination of linear layer, gates and layer-norm, which is then fed into a shared predictor and joiner [27]. For kk subnetworks in the auxiliary RNN-T loss training, there is k×k\times resources increase during training, and an additional model size of k×k\times0.53 million parameters. Note that A2 and A3 exhibit comparable training energy consumption. We use them to investigate the extra regularization effects when loss functions are applied to the intermediate layers of the RNN-T model.

  4. 4.

    B0 represents a channel-reduction-only Supernet. It is a generalized version of the slimmable network [13], in which subnetworks of different layer-wise widths co-exist in a Supernet.

5.1 On the Energy Consumed during Training

In Fig 2 we show the energy consumed by training up to X[3,6,..30]X\in[3,6,..30] number of models. The energy cost of training each Supernet is constant; the energy costs scale linearly with the number of non-Supernet models (Models A1 and A3). Supernet with KD (Model F) consumes more energy than Supernet without KD (Model B), due to the need of using smaller batches sizes and thus longer training time, as the memory consumption of storing gradients and logits for KD is non-trivial.

Model Baselines WER size WER size WER Size
A1 3 models 6.7 99.9 8.3 31.6 7.5 49.5
A2 auxCE8,12,16 6.7 101.1
A3 auxRNNT8,12,16 6.8 101.5 8.8 40.0 7.2 53.1
Model Supernet Pareto WER size WER size WER size
B0 channel only reduction 7.0 99.9 8.0 41.1 7.4 60.0
B1 layer-channel reduction 6.7 99.9 8.9 29.1 7.6 45.9
C B1+adaptive dropout 6.8 99.9 8.7 29.1 7.5 47.0
D C+KLD10 6.7 99.9 9.1 29.1 7.7 49.1
E C+alphaD10 6.7 99.9 9.2 29.1 7.8 49.6
F E120epoch ScaledAdam 6.5 99.9 8.7 29.1 7.4 49.6
Table 1: Word Error Rates of models on the LibriSpeech test-other data set. We show WERs of 3 best model sizes from each training job if possible: \sim99, \sim30, and \sim50 MB, obtained along the Pareto front via evolutionary search. Model sizes are determined after 8-bit quantization. Results are also illustrated in Fig 3.

5.2 On the Design of Supernet Search Space

We lay out three principles for the Supernet search space design in Section 2.1. We demonstrate these using two Supernets in Fig 3. The square magenta dots (Model B0) show Pareto results of the channel-reduction Supernet, otherwise known as the slimmable networks [13]; the round red dots (Model B1) are Pareto results trained with layer-and-channel reduction. Layer reduction expands the search space, allowing the Supernet Pareto search to discover more effective medium sized models, hence resulting in better model accuracy vs model size trade-offs.

5.3 On Probability Sampling for KD

To examine the effectiveness of KD, we evaluated Supernet results at 120 epochs, 2/3 of the total training time. We record the WERs of the max (i.e. entire Supernet), the min, and a fixed architecture (47MB) models in Table 2. The fixed architecture does not always lie on the Pareto front of each Supernet. Due to GPU out-of-memory failures that occur during training with KD of the full 4096-sized logits; “C+KLD4096” is thus not present in Table 2. We find that KD helps the supernet converge faster. The model G120epoch with KD already exceeds the baseline model C120epoch’s WER by relative 10.3%, 4.2%, and 2.4% in its max, min, and fixed-size networks.

Surprisingly, with KD, the max network converges much faster than the smaller networks, even though KD is designed to improve the representations of the smaller networks. We hypothesize that it is because high-quality subnetworks contribute to the overall max network’s performances via weight sharing.

AlphaD and KLD produce similar results, with AlphaD slightly outperforming KLD by 2-3%. However, the difference is not statistically significant. AlphaD could prevent overestimation and underestimation of the teacher’s uncertainties [14], but sampling logits and reducing the probability distribution may already underestimate the teacher’s uncertainty, limiting the effectiveness of AlphaD.

Model KDsampleprob WER99.9MB WER29.1MB WER47.0MB
C120epoch NO KD 7.8 9.6 8.4
G120epoch C+alphaD2 7.0 9.2 8.2
E120epoch C+alphaD10 6.9 9.4 8.1
I120epoch C+alphaD100 7.0 9.5 8.1
J120epoch C+KLD2 7.2 9.4 8.1
D120epoch C+KLD11 7.1 9.4 8.0
L120epoch C+KLD100 7.2 9.6 8.2
Table 2: Supernet trained with probability sampling during in-place knowledge distillation(KD), with alpha divergence (alphaD) or KL divergence(KLD). Subscripts denote sampling range j\in[2,11,100]. Results are evaluated at 120 epochs for max(99.9MB), min(29.1MB) and a fixed architecture (47.0MB) on LibriSpeech test-other dataset.

5.4 On the Ablation of Training Strategies

In the Supernet results in Table 1, the last two columns of the WER-Size pairs are results of the best model on the Pareto front when searching for a RNN-T subnetwork \sim50 million parameters. Model B1 is trained with channel-and-layer reduction; Model C adds adaptive dropout during training; Model D uses both adaptive dropout and KL-divergence with sampled top 10 log-probabilities; Model E, similar to Model D, uses Alpha-divergence instead; Model F uses adaptive dropout, Alpha-divergence (sampled over 10 log-probabilities), and ScaledAdam optimizer fine-tuning after 120 epochs. All models are trained from scratch and evaluated at 180 epochs.

Model A3 in Table 1, trained with RNN-T losses on intermediate layers, is a special case of Supernet B1 – weight sharing between 3 layer-dropped subnetworks. Model A3 has, however, significantly worse performance at 40MB compared to Model B1. Auxiliary regularization and layer-dropping alone don’t explain the Supernet’s superior performance compared to A2 and A3.

Comparing Models B1 and C, adaptive dropout alone improves the Supernet training by only a small margin. Despite the early convergence of D120epoch and E120epoch, models D and E subsequently converge much slower than model C. We observe that while in-place KD helps the Supernet training to converge at earlier stages of the training, it also causes the validation loss to diverge at around 100 to 120 epochs. We thus switch to using the ScaledAdam optimizer at 120 epochs, and reduce the KD loss weight in Equation (4) to λ=0.1\lambda=0.1. Even though the ScaledAdam optimizer is helpful for the last 60 epochs of training, we observe that training TODM from the very beginning with it prevents the Supernet from converging fully.

5.5 Other Observations

We find that Supernet models do not benefit from longer training time. In fact, continuing training Supernet Model C for another 60 epochs results in a relative WER rise by 16.4% and 8.0% for the max and min models respectively. This suggests that TODM converges in a similar number of epochs as training one of the models alone. We also find that increasing the learning rate or seeding the Supernet with a pre-trained model hurts Pareto subnetwork accuracies. The latter indicates that the best weight-sharing Supernet may differ significantly from the max-network if it is trained alone.

Finally, the Pareto front model architecture of Model F at 49.6MB in Fig 1(b) does not have obvious architectural patterns. This may explain why it is difficult to arrive at this architecture through manual tuning.

Throughout our experiments, the trends in WER for test-clean dataset are similar to those for test-other – for example, Models in A1 have test-clean WERs of 2.6 (99.9MB) and 3.5 (31.6MB); Supernet Model F has WERs 2.5 (99.9MB) and 3.4 (29.1MB).

6 Conclusion

This paper presents the TODM framework, which can train and discover optimized, dense RNN-Ts of various sizes from a single Supernet, with comparable training resources to a single-model training job. We introduce three strategies to TODM to improve Supernet outcomes: adaptive dropout, in-place sampled knowledge distillation, and ScaledAdam optimizer fine-tuning. TODM discovers many models along the Pareto front of accuracy vs. size, which would have be resource-intensive to find and train manually via trial-and-error.

References

  • [1] Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International conference on machine learning PMLR, 2014.
  • [2] Tara N Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, et al., “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  • [3] Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, et al., “Dissecting user-perceived latency of on-device e2e speech recognition,” Proc. of Interspeech, 2021.
  • [4] Alex Graves, “Sequence transduction with recurrent neural networks,” International Conference of Machine Learning (ICML) Workshop on Representation Learning, 2012.
  • [5] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in International conference on acoustics, speech and signal processing (ICASSP), 2013.
  • [6] Bo Li, Anmol Gulati, Jiahui Yu, Tara N Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, et al., “A better and faster end-to-end model for streaming asr,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  • [7] Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, and Shinji Watanabe, “End-to-end speech recognition: A survey,” 2023.
  • [8] Shaojin Ding, Weiran Wang, Ding Zhao, Tara N Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, et al., “A unified cascaded encoder asr model for dynamic model sizes,” Proc. of Interspeech, 2022.
  • [9] Yuan Shangguan, Jian Li, Qiao Liang, Raziel Alvarez, and Ian McGraw, “Optimizing speech recognition for the edge,” in Machine Learning and Systems (MLSys), On-device Intelligence Workshop, 2019.
  • [10] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han, “Once-for-all: Train one network and specialize it for efficient deployment,” The International Conference on Learning Representations (ICLR), 2020.
  • [11] Jiahui Yu and Thomas S Huang, “Universally slimmable networks and improved training techniques,” in Proc. of IEEE/CVF international conference on computer vision, 2019.
  • [12] Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, Ruoming Pang, and Quoc Le, “Bignas: Scaling up neural architecture search with big single-stage models,” in Computer Vision ECCV 2020, 2020.
  • [13] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang, “Slimmable neural networks,” in International Conference on Learning Representations (ICLR), 2018.
  • [14] Dilin Wang, Chengyue Gong, Meng Li, Qiang Liu, and Vikas Chandra, “Alphanet: Improved training of supernets with alpha-divergence,” in International Conference on Machine Learning (ICML), 2021.
  • [15] Arun Narayanan, Tara N Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, and Trevor Strohman, “Cascaded encoders for unifying streaming and non-streaming asr,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  • [16] Haichuan Yang, Yuan Shangguan, Dilin Wang, Meng Li, Pierce Chuang, Xiaohui Zhang, Ganesh Venkatesh, Ozlem Kalinli, and Vikas Chandra, “Omni-sparsity dnn: Fast sparsity optimization for on-device streaming e2e asr via supernet,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
  • [17] Chunxi Liu, Yuan Shangguan, Haichuan Yang, Yangyang Shi, Raghuraman Krishnamoorthi, and Ozlem Kalinli, “Learning a dual-mode speech recognition model via self-pruning,” in Spoken Language Technology Workshop (SLT), 2023.
  • [18] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
  • [19] Mu Yang, Andros Tjandra, Chunxi Liu, David Zhang, Duc Le, and Ozlem Kalinli, “Learning asr pathways: A sparse multilingual asr model,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  • [20] Ruoming Pang, Tara Sainath, Rohit Prabhavalkar, Suyog Gupta, Yonghui Wu, Shuyuan Zhang, and Chung-Cheng Chiu, “Compression of end-to-end models,” Proc. of Interspeech, 2018.
  • [21] Ladislav Mošner, Minhua Wu, Anirudh Raju, Sree Hari Krishnan Parthasarathi, Kenichi Kumatani, Shiva Sundaram, Roland Maas, and Björn Hoffmeister, “Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
  • [22] Sankaran Panchapagesan, Daniel S Park, Chung-Cheng Chiu, Yuan Shangguan, Qiao Liang, and Alexander Gruenstein, “Efficient knowledge distillation for rnn-transducer models,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  • [23] Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, and Daniel Povey, “Zipformer: A faster and better encoder for automatic speech recognition,” arXiv preprint arXiv:2310.11230, 2023.
  • [24] Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, et al., “Multi-head state space model for speech recognition,” Proc. Interspeech, 2023.
  • [25] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
  • [26] Taku Kudo and John Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in EMNLP, 2018.
  • [27] Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, and Geoffrey Zweig, “Improving rnn transducer based asr with auxiliary tasks,” in Spoken Language Technology Workshop (SLT), 2021.