\ul
TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition
Abstract
In recent years, there has been a great deal of research in developing end-to-end speech recognition models, which enable simplifying the traditional pipeline and achieving promising results. Despite their remarkable performance improvements, end-to-end models typically require expensive computational cost to show successful performance. To reduce this computational burden, knowledge distillation (KD), which is a popular model compression method, has been used to transfer knowledge from a deep and complex model (teacher) to a shallower and simpler model (student). Previous KD approaches have commonly designed the architecture of the student model by reducing the width per layer or the number of layers of the teacher model. This structural reduction scheme might limit the flexibility of model selection since the student model structure should be similar to that of the given teacher. To cope with this limitation, we propose a KD method for end-to-end speech recognition, namely TutorNet, that applies KD techniques across different types of neural networks at the hidden representation-level as well as the output-level. For concrete realizations, we firstly apply representation-level knowledge distillation (RKD) during the initialization step, and then apply the softmax-level knowledge distillation (SKD) combined with the original task learning. When the student is trained with RKD, we make use of frame weighting that points out the frames to which the teacher model pays more attention. Through a number of experiments on LibriSpeech dataset, it is verified that the proposed method not only distills the knowledge between networks with different topologies but also significantly contributes to improving the word error rate (WER) performance of the distilled student. Interestingly, TutorNet allows the student model to surpass its teacher’s performance in some particular cases.
Index Terms:
Speech recognition, connectionist temporal classification, knowledge distillation, teacher-student learning, transfer learningI Introduction
Recently, there has been a huge interest in the research on end-to-end speech recognition, such as the connectionist temporal classification (CTC) [1], attention encoder-decoder (AED) [2], and recurrent neural network transducer (RNN-T) [3]. End-to-end models directly map an input speech signal into the corresponding sequence of words yielding better performance compared to the conventional deep neural network (DNN)-hidden Markov model (HMM) hybrid systems. However, most end-to-end models require heavy computation and a large number of parameters for a successful performance. In order to achieve competency within the constraints on resources, it is desirable to design a more lightweight model.
![]() |
Knowledge distillation (KD) is one of the most popular approaches for model compression, which aims at transferring knowledge from a bigger network (teacher) to a much smaller network (student). It is generally assumed that the teacher has been trained separately while consuming huge computation. The goal of KD is to make the student mimic the behavior of the teacher, leading to better performance compared to the case when it is solely trained. In designing the student model’s architecture, conventional KD approaches typically reduce the layer width or the number of layers of the teacher model, which reduces the number of parameters dramatically. However, this simple structure reduction scheme usually limits the flexibility of model selection. For speech recognition, there have been various types of end-to-end models, such as DeepSpeech2 [4], Wav2Letter [5], Jasper [6], QuartzNet [7], AED [2], and Transformer [8]. Despite the diversity of such models available, the conventional KD technique usually employs the student model structure similar to that of the given teacher model. In other words, no matter how suitable some models are as a teacher (high performance) or a student (less parameter, fast inference, etc.), the adoption has often been restricted due to their structure being different from the counterpart. For example, Transformer has been proven to perform better in the speech area, but it has not been used as a teacher for the other models, such as CTC or AED models.
To handle this limitation, in this paper, we attempt to apply KD techniques to an unexplored setting where the architecture of the student is inherently different from that of the teacher, as conceptually displayed in Fig. 1. For instance, via the proposed approach, a recurrent neural network (RNN)-based DeepSpeech2 can benefit from the distilled knowledge of a convolutional neural network (CNN)-based Jasper not only at the hidden representation-level but also at the output-level. The opposite case, i.e., transferring knowledge from CNN-based Jasper to RNN-based DeepSpeech2, is also possible. Furthermore, the student can be trained with the knowledge of both CNN-based Jasper and RNN-based DeepSpeech2. In addition to the CTC model, the proposed method can be applied to the other types of models. We apply a network called TutorNet, which connects the teacher and student models even when the two models have different types of structures. TutorNet consists of two stages: (1) representation-level KD (RKD) for initializing the network parameters and (2) softmax-level KD (SKD) for transferring softmax prediction, where both stages can be applied regardless of the difference in model architecture. When training the student model with RKD, we utilize frame weighting that picks the frames to which the teacher model pays attention. We verify the effectiveness of the proposed method via a substantial model comparison.
To summarize, the main contributions of this paper are:
-
•
We introduce TutorNet for transferring the hidden representation and softmax values across different types of neural networks. On top of that, we also make use of frame weighting, reflecting which frames are important for KD.
-
•
To distill frame-level posterior in the CTC framework, we suggest that loss is more suitable than the conventional Kullback-Leibler (KL)-divergence.
-
•
The proposed method substantially outperformed the other conventional KD methods in several speech recognition experiments. It is noted that the student model performs even better than its teacher in some particular cases.
-
•
TutorNet is applicable not only to the CTC-based model but also to the other end-to-end speech recognition models.
The rest of the paper is organized as follows: Related work is described in Section \@slowromancapii@. We introduce the proposed KD method, namely TutorNet, in Section \@slowromancapiii@, describe experimental settings in Section \@slowromancapiv@, and then present experimental results obtained under various settings in Section \@slowromancapv@. Finally, Section \@slowromancapvi@ concludes the paper.
II Related work
II-A Connectionist Temporal Classification (CTC)
Generally, an end-to-end speech recognition model directly maps a sequence of input acoustic features into a sequence of target labels where with being the set of labels in texts. and are respectively the total number of frames and the length of the target label sequence. To deal with the sequence-to-sequence mapping problem when the two sequences have unequal lengths, the CTC framework [1] introduces “blank” as an additional label and allows the repetition of all labels across frames. An alignment sequence is a sequence of initial output labels for each frame, as every input frame is mapped to a certain label where . A mapping function , which is defined as , converts the alignment sequence into the final output sequence after merging consecutive repeated characters and removing blank labels. For example, two alignment sequences and (using ‘’ to denote blank label) correspond to the same sequence through the mapping function . The alignment between the input and the target output is not explicitly required in CTC training. The conditional probability of the target label sequence given the input sequence is defined as
(1) |
where denotes the inverse mapping and returns all possible alignment sequences compatible with . The conditional probability of a path can be calculated as follows:
(2) |
Given the target and the input , the loss function is defined as
(3) |
where represents a training dataset.
![]() |
II-B Attention-based Encoder-Decoder (AED)
An alternative approach to the end-to-end mapping between speech and label sequences is to use the attention-based model. Unlike the CTC framework, AED directly predicts each target without requiring any alignment sequence. The framework consists of two sub-modules and , so that it can learn two different lengths of sequences based on the cross-entropy criterion. The model predicts the posterior probability of the output transcription given the input speech features as follows:
(4) |
(5) |
where and are sequences of input speech features and encoded vectors respectively, and is a sequence of output text units whose length is . extracts encoded vectors from the input speech features in (4). Then network predicts the next output symbol conditioned on the full sequence of previous predictions and acoustics, which can be defined as . The attention mechanism determines which encoded vectors should be attended in order to predict the next output symbol.
II-C Knowledge Distillation
Neural network models usually perform well with a large number of parameters. However, as a model architecture gets deeper, it requires heavy computation for both training and testing. To mitigate this computational burden, there has been a long line of research on KD, which aims at distilling knowledge from a big teacher model to a small student model. With this additional transfer procedure, the student can perform better compared to naive training. Existing KD methods typically fall into two categories: (1) transferring class probability and (2) transferring the representation of the hidden layer.
Generally, the output layer for the classification tasks uses softmax as an activation function. The output of the teacher model is a probability distribution over the target classes, and the sum of the outputs equals 1. Hinton et al. [9] first introduced KD, which distills class probability by minimizing the KL-divergence between the softmax outputs of the teacher and student. Compared to the one-hot label, the teacher’s softmax prediction has a nonzero probability value for each target class. This soft label is normally considered more informative than the one-hot encoded ground truth, further improving the student model in KD. The KD technique mentioned above only considers the output of the teacher model. In the case of transferring the hidden representation, some KD methods [10, 11, 12, 13, 14], especially in image processing, proposed transferring the representation-level information of the hidden layers where the mean squared error (MSE) between the representation-level knowledge of both models is minimized.
For speech, Li et al. [15] first attempted to apply the teacher-student learning. In previous studies for speech recognition, KD typically has been applied to DNN-HMM hybrid systems by minimizing the frame-level cross-entropy loss between the output distributions of the teacher and student [16, 17, 18, 19]. Geras et al. [20] proposed KD method that distills the softmax-level knowledge from RNN-based model to CNN-based model in the DNN-HMM framework. Wong et al. [21, 22] investigated sequence-level knowledge distillation to DNN-HMM trained by sequence discriminative criteria. In the case of KD under the cross-entropy criteria, the KD loss can be calculated as
(6) |
where represents the posterior probability of the target label given the input yielded by the teacher model, and is that of the student model.
The same frame-level KD method has also been applied to KD of CTC models [23]. The frame-level KD in CTC framework can be computed as follows:
(7) |
where and denote the posterior probability of the teacher and student CTC models, respectively. However, as reported in previous studies [23, 24, 25], applying the frame-level KD approach to the CTC-based speech recognition system can worsen the word error rate (WER) performance compared with the CTC model, which is trained only with the ground truth.
To address this problem, Takashima et al. [24] proposed a KD method that distills a sequence-level knowledge in the CTC framework using the teacher model’s N-best hypotheses. Kurata and Audhkhasi [26, 27] also introduced a KD approach for long short-term memory (LSTM)-CTC-based speech recognition model where a student can be trained using the frame-wise alignment of the teacher.
III Proposed Method
Our main goal is to transfer the teacher model’s knowledge to the student model without being restricted to the types of model architecture. As shown in Fig. 2, TutorNet is mainly composed of two stages: (1) representation-level KD (RKD) with frame weighting and (2) softmax-level KD (SKD). In Section \@slowromancapiii@-A, we first describe the initialization step RKD, where the student benefits from the hidden representations of the teacher. Although the two models have different types of architecture, the proposed method enables flexible knowledge transfer at the representation-level. While training the student with RKD, we make use of frame weighting that picks the frames to which the teacher model pays attention. Section \@slowromancapiii@-B subsequently introduces SKD, which allows the student to frame-wisely track the posterior distribution of the teacher. Previous studies [23, 24, 25] have found that applying the conventional frame-level KD to the student CTC model is a challenging problem. Based on these observations, we adopt the loss function instead of the conventional objective function.
III-A Representation-Level Knowledge Distillation with Frame Weighting
As mentioned above, the most frequently employed KD approach for speech recognition is to train a student with the teacher’s softmax prediction as a target, besides the one-hot encoded ground truth. However, the hidden representations of the teacher model are also considered important to provide essential knowledge for training the other. Moreover, if we can transfer hidden representations regardless of the types of neural network architecture, more flexible and effective distillation will be possible.
III-A1 Hidden representation matching using 1D convolutional layer
Let and respectively denote the hidden representation from the i-th and j-th layers of the teacher and student models where both models are assumed to have different architecture. In the CTC framework, when the speech signal is given as an input to both models, the i-th hidden layer representation of the teacher model can be expressed as , where represents the total number of frames with the hidden layer dimension . In a similar way, , where denotes the hidden layer width. Since usually the hidden layer dimensions and are different, we apply a convolutional layer to minimize the following mismatch error:
(8) |
where is a 1D convolutional layer, and denotes its parameters. The essential difference from Fitnets [11] is that we attempt to transfer the hidden representation across different types of neural networks for end-to-end speech recognition, which is an unexplored area in the related research. Concerning that has a different structural nature from , the convolutional layer not only converts the hidden layer size of the student from to but also effectively distills the teacher’s hidden layer information even when both models have different architecture.
![]() |
III-A2 Frame weighting
It is generally accepted that each frame of the speech has different importance for KD. For instance, active speech periods should be treated more importantly than the silence periods. For this reason, instead of transferring all the hidden representations equally, we employ frame weighting that puts more emphasis on frames that correspond to the neurons with high activations. A frame weighting function takes the teacher’s representation as an input and outputs a frame weighting mask . As illustrated in Fig. 3, the procedure for computing is as follows:
-
a)
Average over the hidden representation dimension (horizontal) axis, where the resulting vector is in .
-
b)
For normalization, map the average values to the interval by using a sigmoid function .
-
c)
Replicate the values of the vector for times along the hidden layer dimension to produce .
RKD is used to initialize the student’s parameters before CTC training. It aims to transfer the hidden representations which serve good initialization of the model parameters. In RKD, considering the frame weighting mask , the student model is trained to minimize the following loss function:
(9) |
where represents the set of candidate layer index pairs, and indicates the Hadamard product. In (9), it is assumed that the i-th layer of the teacher model is transferred to the j-th layer of the student when .
III-B Softmax-Level Knowledge Distillation
![]() |
Before we describe SKD for distilling frame-level posterior in the CTC framework, it might be beneficial to review some characteristics of the CTC model. The output of the CTC model has two notable characteristics to consider when transferring a posterior distribution to the other. First, as noted in prior studies [1, 28], the softmax prediction obtained from a CTC-trained model is very spiky. This indicates that the softmax output tends to be similar to the one-hot vector. Secondly, since CTC is an alignment-free framework, CTC models trained with the same training data can have different frame-level alignments. In other words, the frame-level alignment yielded by the student can be different from that of the teacher. Unfortunately, this characteristic makes KD difficult because the KL-divergence and cross-entropy can hardly converge. Suppose we have two probability distributions and obtained at a certain frame . As shown in Fig. 4, has the highest probability on label ‘a’ and on ‘blank’. In this case, the KL-divergence becomes infinity and KD is hard to converge.
In previous studies [23, 24, 25], it has been confirmed that applying the conventional frame-level KD approach can worsen the performance of a student CTC model compared with a model trained without KD. Also, we tried to train a student CTC model with the interpolation between the original CTC loss in (3) and frame-level KD in (7), but it failed to converge as reported in [24].
To deal with this instability problem, we propose to use loss instead of the KL-divergence. Since distance between two distributions is always bounded unlike the KL-divergence, it improves the numerical stability of the distillation loss. This alternative approach, namely SKD, allows student to stably learn the alignment of all the output labels, including blank ones. The SKD loss to train the student model is given as
(10) |
where and are the logits obtained from the teacher and student models respectively, and represents a temperature parameter. During SKD training, the SKD loss and the standard CTC loss are combined as an integrated loss. Thus, the final objective function for SKD can be formulated as
(11) |
where is a tunable parameter.
III-C Learning Procedure
Our approach includes two stages of training: RKD as the initialization step and SKD in conjunction with the CTC objective as the fine-tuning step. Firstly, by minimizing without the CTC loss function, the student can mimic the teacher’s behavior at the representation-level. To effectively transfer the representation-level knowledge of the teacher, we apply the frame weighting mask as in (9). After the RKD initialization, the student model is trained with SKD. As the student is trained to minimize the combined loss , it can frame-wisely track the softmax prediction of the teacher.
IV Experimental settings
IV-A Dataset
We evaluated the performance of the proposed method on two speech datasets: LibriSpeech [29] and AISHELL-2 [30]. LibriSpeech is a large-scale (about 1000 hours) English speech corpus derived from audiobooks, sampled at 16kHz. The dataset is divided into clean and other. In the experiment, “train-clean-100”, “train-clean-360”, and “train-other-500” were used in the training phase. For evaluation, “dev-clean”, “dev-other”, “test-clean”, and “test-other” were applied. We also conducted our experiments on AISHELL-2, the Mandarin read-speech dataset (around 1000 hours of training data). The model evaluated over AISHELL2-2018A-EVAL data, containing dev set (2500 utterances from 5 speakers) and test set (5000 utterances from 10 speakers).
IV-B Performance Metrics
For LibriSpeech, we measured two metrics: word error rate (WER) and relative error rate reduction (RERR). WER is commonly employed to quantify speech recognition performance. To calculate WER, the number of errors is obtained by counting the substitutions, insertions, and deletions that occur in the recognition result. Then, it is divided by the total number of words in the correct sentence. RERR shows how much the WER is reduced, in proportion, compared to the baseline. For the Mandarin dataset, we measured the character error rate (CER) instead of WER. This is because a single character often represents a word for the Mandarin writing system. CER follows the same formula of WER, but with characters as the unit.
IV-C Model Configuration
For KD, we adopted the following different speech recognition models which were used as the teacher or student model:
-
•
CTC model
-
–
Jasper Dense Residual (Jasper DR) [6]: Jasper DR is a deep time-delay neural network (TDNN) composed of blocks of 1D convolutional layers. This differs from the original Jasper in that it has dense residual connections. The output of the convolution block is fed as an input to all the blocks via dense residual connections. In our experiments, we applied pre-trained Jasper DR, which consists of 54 convolutional layers, as the CNN-based teacher model.
-
–
DeepSpeech2 [4]: DeepSpeech2 has the architecture of a deep RNN with a combination of convolutional and fully connected layers. As for the RNN-based model, we applied DeepSpeech2, consisting of two 2D convolutional layers followed by three 512-dimensional bidirectional LSTM (BLSTM) layers and one fully connected layer.
-
–
Jasper Mini: Jasper Mini is composed of blocks of depthwise separable 1D convolutional layers. Depthwise separable convolution reduces the number of parameters and the computation required in the convolutional operations. The main structural difference from the Jasper DR is that Jasper Mini consists of depthwise separable convolutions with no dense residual connection. As for the CNN-based student model, we adopted Jasper Mini with 33 depthwise separable convolutional layers.
-
–
QuartzNet [7]: QuartzNet is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU. We adopted Quartznet, which consists of 54 1D time-channel separable convolutional layers.
-
–
-
•
Hybrid CTC/attention model [31]
-
–
Attention-based encoder-decoder (AED): Motivated by the prior studies [32, 33], the encoder contained the initial layers of the VGG net architecture [34] followed by four 1024-dimensional BLSTM layers and a linear projection layer, which yields better performance than the pyramid BLSTM [35] in many cases. We used a location-based attention mechanism, and the decoder network was a 2-layer LSTM with 1024 cells. The model was trained by both CTC and attention model objectives simultaneously.
-
–
Table I compares WERs obtained from the three baseline models when greedy decoding was applied and the number of parameters. Since the DeepSpeech2 baseline with BPE units failed to converge with a single BPE CTC loss, we adopted the CTC model, initialized with RKD over the first 100 steps, as the student baseline.
Dataset | Model Type | Unit Set | WER (%) | Params. (M) |
Jasper DR | Character | 3.61 | 332.63 | |
DeepSpeech2 | Character | 7.64 | 13.19 | |
Jasper Mini | Character | 8.66 | 8.19 | |
LibriSpeech | QuartzNet | Character | 5.66 | 13.09 |
AED | BPE | 5.80 | 174.10 | |
Transformer | BPE | 3.21 | 78.36 | |
DeepSpeech2 | BPE | 8.82 | 21.71 | |
Jasper DR | Character | 9.69 | 337.94 | |
AISHELL-2 | Jasper Mini | Character | 11.77 | 13.50 |
DeepSpeech2 | Character | 13.40 | 15.77 |
IV-D Implementation Details
Our experiments were conducted using three toolkits: OpenSeq2Seq [36], ESPnet [37], and NeMo [38]. When training the CTC-based model, we mainly used OpenSeq2Seq. In the case of hybrid CTC/attention model, such as AED and Transformer, we applied the ESPnet toolkit. Pre-trained Jasper DR for the Mandarin dataset was provided by the NeMo toolkit.
-
•
LibriSpeech
-
–
CTC model: We extracted 64-dimensional log-Mel filterbank features as the input, and the character set has a total of 29 labels with all lowercased Latin alphabet letters (a-z), apostrophe, space, and blank label. In the case of Jasper DR, we used the checkpoint provided by the OpenSeq2Seq toolkit. For character-level CTC-based student models, all the training excluding QuartzNet was performed on three Titan V GPUs, each with 12GB of memory. While training the RNN-based DeepSpeech2, Adam algorithm [39] was employed as an optimizer with an initial learning rate of 0.001 that was reduced with polynomial decay. In the case of the CNN-based Jasper Mini, we used NovoGrad optimizer [40] based on stochastic gradient descent (SGD). The initial learning rate of the optimizer started from 0.02, and it also had the same decaying policy as above. For RKD, we selected the last layer111CTC is regarded as a frame-level classification. Based on the general use of the last layer in KD of classification task, we chose the last layer for RKD training. of the teacher and student models to transfer the hidden representation, and the student was trained for 5 epochs. When transferring the knowledge from CNN-based Jasper DR to RNN-based DeepSpeech2, the kernel size of , , and were 1, 1024, and 512, respectively. In distilling from RNN-based DeepSpeech2 to CNN-based Jasper Mini, the kernel size of , , and were 1, 512, and 1024, respectively. After the RKD initialization, 50 epochs were spent for CTC training with SKD. In the case of QuartzNet, we trained the student model with 10 epochs for RKD initialization and 100 epochs for SKD training. The training of Quartznet was performed on three Titan RTX GPUs, each with 24GB of memory. In the case of QuartzNet, , and were equal to 1024. We experimentally set the tunable parameter to 0.25, which showed the best performance in the dev-clean set. When applying beam-search decoding with language model (LM), we used KenLM [41] for 4-gram LM, where the LM weight, the word insertion weight, and the beam width were experimentally set to 2.0, 1.5, and 256, respectively. When training the CTC model distilled from Transformer, since we used the byte-pair encoding (BPE) [42] method to construct the subword units (around 5000 tokens) for Transformer, the CTC-based student model was also trained with the same BPE units of the teacher model. The training was performed on three Titan RTX GPUs, each with 24GB of memory. In distilling the knowledge from Transformer to CTC-based model, 5 epochs and 50 epochs were spent for RKD and SKD, respectively.
-
–
Hybrid CTC/attention model: When training AED, we used one Titan V GPU. Adadelta [43] was employed as an optimizer with an initial learning rate of 1.0. During training and inference, we set the CTC weight to 0.5. For RKD, the student was trained for 1 epoch, and then it was trained with SKD for 10 epochs. In the case of Transformer, we used the pre-trained model provided by the ESPnet toolkit. We used BPE to construct the subword units for the label of both AED and Transformer (around 5000 tokens). For the external LM, we used pre-trained RNNLM from ESPnet toolkit. The beam size was experimentally set to 20. When transferring the knowledge from Transformer to CTC-based model, , and were equal to 512. In KD from Transformer to AED model, , and were 512 and 1024, respectively.
-
–
-
•
AISHELL-2: For the Mandarin dataset, the character set has a total of 5207 labels. In the case of the teacher model, we used pre-trained Jasper DR, which was provided by the NeMo toolkit. When training the student model, we used three Titan V GPUs. For the CNN-based student model, the NovoGrad optimizer was adopted. In the case of the RNN-based student model, the Adam algorithm was employed as an optimizer. For RKD and SKD, 2 epochs and 20 epochs were spent, respectively. When training with SKD, we set to 1.
IV-E Conventional KD Techniques for Performance Comparison
We applied the following conventional KD techniques for performance comparison222We tried to train a student CTC model with the interpolation between the original CTC loss in (3) and frame-level KD in (7), and it failed to converge as reported in [24]. Therefore, the frame-level KD was not considered for comparison.:
- •
-
•
Sequence-level knowledge distillation [24]: N-best hypotheses of the teacher are used as a sequence-level knowledge for KD. In this experiment, the 5-best333The Sequence-level KD criteria can be summarized as the weighted mean of the original CTC loss regarding each hypothesis of the label sequence. The posterior probabilities of the hypotheses estimated by the teacher CTC model are used as the weights of each CTC loss. In our experiments, most probabilities were concentrated in the highest-scoring hypothesis, which was at the top of the beam. Even when we extracted more sentences than 5-best, the posterior probabilities were mostly concentrated on very few sentences. That is why we used 5-best hypotheses for the comparison. hypotheses were extracted using KenLM, where the LM weight, the word insertion weight, and the beam width were experimentally set to 2.0, 1.5, and 256, respectively.
-
•
Guided CTC training [27]: From the posterior distribution of the teacher, guided CTC training makes a mask that sets 1 only at the output label of the highest posterior at each frame. The student can be guided to align with the frame-level alignment of the teacher by using the guided mask.
V Experimental results
V-A LibriSpeech
For the convenience of notation and ease of comparison, we let CNNDR, RNNDS, and CNNMini denote Jasper DR, DeepSpeech2, and Jasper Mini, respectively. In the subsequent part of this paper, means that model transfers knowledge to model where the former is a teacher and the latter is a student. To verify the effectiveness of the proposed method in various situations, we tried six different transfer scenarios on LibriSpeech dataset:
-
•
CNNDR RNNDS: From CNN-based model (Jasper DR) to RNN-based model (DeepSpeech2)
-
•
CNNDR CNNMini: From CNN-based model (Jasper DR) to CNN-based model (Jasper Mini)
-
•
RNNDS CNNMini: From RNN-based model (DeepSpeech2) to CNN-based model (Jasper Mini)
-
•
RNNDS RNNDS: From RNN-based model (DeepSpeech2) to RNN-based model (DeepSpeech2)
-
•
(CNNDR & RNNDS) CNNMini: From two teachers (Jasper DR & DeepSpeech2) to CNN-based model (Jasper Mini)
-
•
(CNNDR & RNNDS) QuartzNet: From two teachers (Jasper DR & DeepSpeech2) to CNN-based model (QuartzNet)
V-A1 CNNDR RNNDS
WER (%) w/o LM | WER (%) w/ LM | ||||||||
---|---|---|---|---|---|---|---|---|---|
KD | Model | dev | test | dev | test | ||||
clean | other | clean | other | clean | other | clean | other | ||
CNNDR | 3.61 | 11.37 | 3.77 | 11.08 | 3.04 | 9.52 | 3.69 | 9.38 | |
RNNDS | 7.64 | 22.02 | 7.70 | 22.60 | 4.88 | 16.00 | 5.18 | 16.55 | |
(1) CNNDR RNNDS | + Sequence-level KD [24] | 7.69 | 22.17 | 8.01 | 22.91 | 4.91 | 15.93 | 5.33 | 16.69 |
+ Guided CTC training [27] | 7.32 | 22.16 | 7.63 | 22.66 | 5.37 | 17.33 | 5.55 | 17.61 | |
+ Ours | 6.64 | 21.16 | 6.97 | 21.06 | 4.60 | 15.60 | 4.98 | 16.27 | |
CNNDR | 3.61 | 11.37 | 3.77 | 11.08 | 3.04 | 9.52 | 3.69 | 9.38 | |
CNNMini | 8.66 | 23.28 | 8.85 | 24.26 | 4.83 | 15.53 | 5.24 | 16.40 | |
(2) CNNDR CNNMini | + Sequence-level KD [24] | 8.96 | 23.73 | 9.10 | 24.81 | 5.16 | 15.54 | 5.48 | 16.91 |
+ Guided CTC training [27] | 7.81 | 21.93 | 8.29 | 22.49 | 5.17 | 15.94 | 5.58 | 16.85 | |
+ Ours | 6.12 | 18.02 | 6.23 | 18.69 | 4.41 | 13.89 | 4.72 | 14.23 | |
RNNDS1 | 6.64 | 21.16 | 6.97 | 21.06 | 4.60 | 15.60 | 4.98 | 16.27 | |
CNNMini | 8.66 | 23.28 | 8.85 | 24.26 | 4.83 | 15.53 | 5.24 | 16.40 | |
(3)-(a) RNNDS1 CNNMini | + Sequence-level KD [24] | 8.17 | 22.43 | 8.46 | 23.34 | 5.17 | 15.60 | 5.53 | 16.52 |
+ Guided CTC training [27] | 8.01 | 21.94 | 8.11 | 22.53 | 5.25 | 15.91 | 5.63 | 16.47 | |
+ Ours | 6.26 | 18.01 | 6.33 | 18.37 | 4.50 | 13.78 | 4.75 | 14.00 | |
RNNDS2 | 7.21 | 21.78 | 7.47 | 22.15 | 4.83 | 15.59 | 5.08 | 16.52 | |
CNNMini | 8.66 | 23.28 | 8.85 | 24.26 | 4.83 | 15.53 | 5.24 | 16.40 | |
(3)-(b) RNNDS2 CNNMini | + Sequence-level KD [24] | 7.49 | 20.81 | 7.55 | 21.84 | 5.08 | 15.30 | 5.22 | 16.08 |
+ Guided CTC training [27] | 7.74 | 21.29 | 7.85 | 21.88 | 5.26 | 15.61 | 5.69 | 16.24 | |
+ Ours | 6.25 | 18.10 | 6.27 | 18.82 | 4.40 | 13.88 | 4.83 | 14.44 | |
RNNDS1 | 6.64 | 21.16 | 6.97 | 21.06 | 4.60 | 15.60 | 4.98 | 16.27 | |
RNNDS | 7.64 | 22.02 | 7.70 | 22.60 | 4.88 | 16.00 | 5.18 | 16.55 | |
(4)-(a) RNNDS1 RNNDS | + Sequence-level KD [24] | 7.41 | 21.51 | 7.58 | 22.23 | 5.02 | 16.29 | 5.23 | 17.16 |
+ Guided CTC training [27] | 7.34 | 21.84 | 7.54 | 22.33 | 5.22 | 16.64 | 5.49 | 16.98 | |
+ Ours | 6.51 | 20.28 | 6.77 | 20.58 | 4.63 | 15.37 | 4.95 | 15.66 | |
RNNDS2 | 7.21 | 21.78 | 7.47 | 22.15 | 4.83 | 15.59 | 5.08 | 16.52 | |
RNNDS | 7.64 | 22.02 | 7.70 | 22.60 | 4.88 | 16.00 | 5.18 | 16.55 | |
(4)-(b) RNNDS2 RNNDS | + Sequence-level KD [24] | 7.39 | 21.76 | 7.56 | 22.26 | 5.05 | 16.41 | 5.30 | 17.33 |
+ Guided CTC training [27] | 7.39 | 21.83 | 7.49 | 22.24 | 5.17 | 16.80 | 5.38 | 17.37 | |
+ Ours | 6.52 | 19.98 | 6.80 | 20.93 | 4.54 | 15.14 | 4.90 | 16.21 |
TutorNet | WER (%) | RERR (%) |
---|---|---|
SKD | 7.21 | 5.63 |
SKD+RKD w/o | 6.74 | 11.78 |
SKD+RKD w/ | 6.64 | 13.09 |
We first experimented with TutorNet in CNNDR RNNDS scenario. Table II-(1) summarizes the WER results on LibriSpeech, comparing the performance of the proposed method with other previous KD approaches. The results show that most of the conventional KD methods did not always perform better than the student baseline, and in some cases, their performances were even worse. In contrast, TutorNet always improved the performance of the original student. As shown in Table III, each stage of TutorNet was found to be useful for training the student. We observed that RNNDS can be effectively trained using the hidden representation of CNNDR via TutorNet. Also, RKD with frame weighting contributed to improving the WER performance compared with the unweighted RKD. Our best performance was achieved in case of using both RKD and SKD, which yielded WER 6.64% (RERR 13.09 %) with greedy decoding and WER 4.60 % (RERR 5.74 %) with beam-search decoding in the dev-clean dataset.
TutorNet | WER (%) | RERR (%) |
---|---|---|
SKD | 7.64 | 11.78 |
SKD+RKD w/o | 6.40 | 26.10 |
SKD+RKD w/ | 6.12 | 29.33 |
Teacher model | TutorNet | WER (%) | RERR (%) |
---|---|---|---|
SKD | 7.32 | 15.47 | |
RNNDS1 | SKD+RKD w/o | 6.40 | 26.10 |
SKD+RKD w/ | 6.26 | 27.71 | |
SKD | 6.65 | 23.21 | |
RNNDS2 | SKD+RKD w/o | 6.32 | 27.02 |
SKD+RKD w/ | 6.25 | 27.83 |
V-A2 CNNDR CNNMini
We also conducted experiments on KD between CNNDR and CNNMini. The main structural difference between the two models is that CNNMini consists of depthwise separable convolutions with no dense residual connection. The results in Table II-(2) report that TutorNet showed significant improvements even when the teacher and student models had different convolution types. In the case of greedy decoding, compared to the previous results in Table II-(1), the guided CTC training [27] achieved considerable improvement over the student baseline with WER 7.81% for dev-clean. Still, the proposed method showed the best performance in all configurations with WER 6.12 % on the dev-clean dataset. When applying beam-search decoding with LM, TutorNet achieved WER 4.41 % on dev-clean, even though most of the conventional methods did not show the improvement over the baseline in CNNDR CNNMini. From Table IV, we can also confirm that SKD and RKD significantly improved the WER performance of the original student in all cases, and the best performance was obtained when both were applied.
V-A3 RNNDS CNNMini
In addition to the previous experiments, we proceeded to verify whether CNNMini can benefit from the guidance of RNNDS. Since our distilled RNNDS in Section \@slowromancapv@-A-1 (Table II-(1)) performed better than the original DeepSpeech2 in the OpenSeq2Seq toolkit, we could employ it as the teacher in this scenario. In this experiment, we applied two different RNNDS models as the teachers, where both had previously been distilled in CNNDR RNNDS scenario. Therefore, RNNDS CNNMini can be rewritten as (CNNDR RNNDS) CNNMini. To distinguish the two RNN-based teachers, we used numbers, i.e., RNNDS1 and RNNDS2.
![]() |
First, we applied RNNDS1 (WER 6.64 % with greedy decoding) as the teacher. WERs on LibriSpeech corpus are shown in Table II-(3)-(a). From the results, we observed that TutorNet helped CNNMini to benefit from the RNN-based teacher, while achieving better performance than the other conventional methods. Interestingly, when both SKD and RKD were applied, the distilled CNNMini performed better than its teacher in all cases, including greedy decoding and beam-search decoding. In the dev-clean dataset, with greedy decoding, TutorNet achieved WER 6.26 %, although that of RNNDS1 was 6.64 %. In the case of beam-search decoding, TutorNet also outperformed the teacher, as it achieved WER 4.50 % on dev-clean dataset. From Table V, we can discover some interesting results of SKD training. When applying SKD, RNNDS1 CNNMini provided WER 7.32 % with greedy decoding. On the other hand, as presented in Table IV, CNNDR CNNMini showed WER 7.64 %. In terms of the WER performance, though RNNDS1 (WER 6.64 % with greedy decoding) was 3.03 %p 444We used the term “%p” as a percentage point, which means the arithmetic difference between two percentages. worse than CNNDR (WER 3.61 % with greedy decoding), the performance of the distilled student was 0.32 %p better. Considering that CNNDR teacher required high computational resources (400 epochs with 8 Tesla V100 GPUs, each with 32GB of memory) to achieve WER 3.61 %, RNNDS (50 epochs with 3 Titan V GPUs, each with 16GB of memory) could be an efficient alternative for transferring softmax-level knowledge.
From these experimental results regarding SKD, we verified that the teacher with the high WER performance did not necessarily help the student model’s training. In some particular cases, the poor teacher could be more supportive for training the student. To further check the effect of the teacher’s performance on SKD training, we adopted RNNDS2 (WER 7.21 % with greedy decoding on dev-clean) as another RNN-based teacher model, where the knowledge had been transferred from CNNDR in Section \@slowromancapv@-A-1 (Table III). Table II-(3)-(b) gives the WER results on LibriSpeech corpus. Compared to the competing KD methods, TutorNet showed better WER improvements in all cases. Also, TutorNet allowed CNNMini to outperform its teacher when applying both SKD and RKD, provided with WER 6.25 % with greedy and WER 4.40 % with beam-search decoding. As given in Table V, RNNDS2 CNNMini achieved WER 6.65 % with SKD training, though RNNDS1 CNNMini provided WER 7.32 %. This means that the softmax prediction of RNNDS2 (WER 7.21 % with greedy decoding) was considered more informative than that of RNNDS1 (WER 6.64 % with greedy decoding). When we applied RKD and SKD altogether, the performance of RNNDS2 CNNMini was slightly better than that of RNNDS1 CNNMini, but the difference was negligible.
Next, we compared when transferring from different teachers to the same CNNMini. Fig. 5 displays the variation of the training loss over time with three different cases: (1) CNNDR CNNMini, (2) RNNDS1 CNNMini, and (3) RNNDS2 CNNMini. With SKD training, RNNDS2 CNNMini showed faster convergence than the others, which indicates that the teacher model with the highest WER performance did not necessarily help the student model’s training. In other words, RNNDS2 can be more supportive in distilling knowledge, notwithstanding its smaller parameter size (13.1 M parameters) and worse performance (WER 7.21 % with greedy decoding) compared to CNNDR (332.6 M parameters / WER 3.61 % with greedy decoding).
WER (%) | RERR (%) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Student model | SKD teacher | RKD teacher | LM | dev | test | dev | test | ||||
clean | other | clean | other | clean | other | clean | other | ||||
CNNMini | - | - | - | 8.66 | 23.28 | 8.85 | 24.26 | - | - | - | - |
RNNDS2 | RNNDS2 | 6.25 | 18.10 | 6.27 | 18.82 | 27.83 | 22.25 | 29.15 | 22.42 | ||
CNNDR | CNNDR | 6.12 | 18.02 | 6.23 | 18.69 | 29.33 | 22.59 | 29.60 | 22.96 | ||
RNNDS2 | CNNDR | 6.03 | 18.02 | 6.20 | 18.52 | 30.37 | 22.59 | 29.94 | 23.66 | ||
CNNMini | - | - | 4-gram | 4.83 | 15.53 | 5.24 | 16.40 | - | - | - | - |
RNNDS2 | RNNDS2 | 4.40 | 13.88 | 4.83 | 14.44 | 8.90 | 10.62 | 7.82 | 11.95 | ||
CNNDR | CNNDR | 4.41 | 13.89 | 4.72 | 14.23 | 8.70 | 10.56 | 9.92 | 13.23 | ||
RNNDS2 | CNNDR | 4.40 | 13.78 | 4.62 | 14.32 | 8.90 | 11.27 | 11.83 | 12.68 |
WER (%) | RERR (%) | |||||||||||
Student model | SKD teacher | RKD teacher | LM | Augmentation | dev | test | dev | test | ||||
clean | other | clean | other | clean | other | clean | other | |||||
QuartzNet | - | - | - | - | 5.66 | 16.57 | 5.85 | 17.18 | - | - | - | - |
RNNDS2 | CNNDR | 4.95 | 15.08 | 4.95 | 15.46 | 12.54 | 8.99 | 15.38 | 10.01 | |||
QuartzNet | - | - | - | SpecAugment [45] | 5.33 | 15.32 | 5.33 | 15.48 | - | - | - | - |
RNNDS2 | CNNDR | 4.68 | 13.92 | 4.68 | 14.06 | 12.20 | 9.14 | 12.20 | 9.17 | |||
QuartzNet | - | - | 4-gram | - | 4.03 | 12.36 | 4.66 | 13.25 | - | - | - | - |
RNNDS2 | CNNDR | 3.95 | 12.25 | 4.32 | 12.56 | 1.99 | 0.89 | 7.30 | 5.21 | |||
QuartzNet | - | - | 4-gram | SpecAugment [45] | 4.01 | 11.44 | 4.16 | 11.50 | - | - | - | - |
RNNDS2 | CNNDR | 3.54 | 10.78 | 3.86 | 11.14 | 11.72 | 5.77 | 7.21 | 3.13 |
V-A4 RNNDS RNNDS
![]() |
In the previous experiments covered in Section \@slowromancapv@-A-1, \@slowromancapv@-A-2, and \@slowromancapv@-A-3, we mainly paid attention to how well TutorNet can distill the knowledge between networks with different topologies. On top of that, we tried to verify that TutorNet still works well in RNNDS RNNDS transfer, which is a typical KD case. To maintain the same model configuration, we repetitively used RNNDS1 (WER 6.64 % with greedy decoding) and RNNDS2 (WER 7.21 % with greedy decoding) as the teachers. Since both RNNDS1 and RNNDS2 had previously been transferred from CNNDR in Section \@slowromancapv@-A, RNNDS RNNDS can be rewritten as (CNNDR RNNDS) RNNDS.
From Table II-(4)-(a), we can confirm that TutorNet gave better WER improvements than the other conventional approaches. It means that TutorNet is well applied in the case of normal KD, where both teacher and student models had the same model architecture. Our best results were achieved when training the student with both RKD and SKD. Compared to the previous results in Table II-(1), RNNDS1 RNNDS performed better than CNNDR RNNDS, except for dev-clean dataset with beam-search decoding. This means that RNNDS1 could be considered more supportive than CNNDR in transferring the knowledge to RNNDS. When we applied both RKD and SKD, RNNDS1 RNNDS provided WER 6.51 % in the dev-clean dataset with greedy decoding, which was 0.13 %p better than that of CNNDR RNNDS.
Table II-(4)-(b) shows the WER results obtained from the RNNDS2 RNNDS scenario. In the case of beam-search decoding, RNNDS2 RNNDS performed better than RNNDS1 RNNDS, except for test-clean dataset. The results show that most of the conventional KD methods did not always perform better than the student baseline, and in some cases, their performances were even worse. In contrast, TutorNet always improved the performance of the original student in both RNNDS1 RNNDS and RNNDS2 RNNDS.
In addition, we compared when distilling from different teachers to the same RNNDS. In Fig. 6, We plot the change in over time in three different cases: (1) CNNDR RNNDS, (2) RNNDS1 RNNDS, and (3) RNNDS2 RNNDS. In the case of training SKD, RNNDS RNNDS scenarios showed faster convergence than CNNDR RNNDS. Considering the high performance of CNNDR (WER 3.61 % with greedy decoding), it is interesting that RNNDS1 RNNDS and RNNDS2 RNNDS showed faster convergence than CNNDR RNNDS. Unlike Fig. 5, there was no significant difference between RNNDS1 RNNDS and RNNDS2 RNNDS.
V-A5 (CNNDR & RNNDS) CNNMini
CER (%) | |||||||
---|---|---|---|---|---|---|---|
KD | Model | dev | test | ||||
iOS | Android | Mic | iOS | Android | Mic | ||
CNNDR RNNDS | CNNDR | 9.69 | 11.48 | 12.23 | 9.37 | 10.84 | 11.84 |
RNNDS | 13.40 | 16.66 | 17.93 | 12.51 | 14.08 | 16.95 | |
+ Guided CTC training [27] | 12.74 | 14.98 | 15.55 | 12.36 | 13.91 | 14.80 | |
+ Ours | 12.30 | 14.22 | 14.77 | 12.07 | 13.52 | 14.35 | |
CNNDR CNNMini | CNNDR | 9.69 | 11.48 | 12.23 | 9.37 | 10.84 | 11.84 |
CNNMini | 11.77 | 14.23 | 15.03 | 11.38 | 12.71 | 14.27 | |
+ Guided CTC training [27] | 11.34 | 13.33 | 14.49 | 10.87 | 12.40 | 13.88 | |
+ Ours | 10.65 | 12.74 | 13.99 | 10.08 | 11.41 | 13.38 |
The results of the previous experiments, especially in Sections \@slowromancapv@-A-2 and \@slowromancapv@-A-3, suggest that the teacher with the highest WER performance did not necessarily help the training of the student, and the best-performing teacher was different for each level of knowledge. For instance, even though RNNDS2 had worse achievement than the other teachers, it was shown to be more supportive in transferring the softmax-level knowledge. Meanwhile, CNNDR was more effective in distilling the representation-level knowledge to CNNMini. These observations lead us to an interesting perspective: Can we get more improved results if we select separate teachers for RKD and SKD?
In order to verify this question, we conducted additional experiments applying CNNDR as an RKD teacher and RNNDS2 as an SKD teacher. CNNMini was adopted as the target of distillation. The WER results are described in Table VI. CNNMini distilled from suitable teachers for each stage showed better performance. The performance difference between CNNDR CNNMini and (CNNDR & RNNDS) CNNMini looks a little negligible, but (CNNDR & RNNDS) CNNMini showed better WER improvements in all cases with greedy decoding. Also, when applying beam-search decoding, (CNNDR & RNNDS) CNNMini performed better than the others in most cases, including dev-clean, dev-other, and test-clean dataset. From these results, we believe that the proposed method enables much more flexible model selection in KD, which means that we can select suitable teacher models for each stage regardless of the model type via TutorNet. Such flexibility in KD can give more possibilities for better achievement.
V-A6 (CNNDR & RNNDS) QuartzNet
In Section \@slowromancapv@-A-5, we adopted CNNDR as an RKD teacher and RNNDS2 as an SKD teacher, which gave a better performance in our experiments. Based on these results, we tried to apply the same teacher selection (RKD teacher: CNNDR, SKD teacher: RNNDS2) to QuartzNet, since QuartzNet can achieve near state-of-the-art accuracy among CTC models while having fewer parameters. Also, in order to further improve the performance, we applied SpecAugment [45] as an augmentation technique. Table VII gives the WER results on LibriSpeech. With greedy decoding, TutorNet (w/o SpecAugment) showed WER 4.95 % and RERR 12.54 % on the dev-clean dataset. Our best performance was achieved when applying TutorNet with SpecAugment, which yielded WER 4.68 % and RERR 12.20 %. In the case of beam-search decoding, (CNNDR & RNNDS) QuartzNet (w/ SpecAugment) provided 3.54 % on dev-clean, which was the highest performance in our experiments.
V-B AISHELL-2
In order to show the versatility of the proposed method, we also conducted our experiment on AISHELL-2:
-
•
CNNDR RNNDS: From CNN-based model (Jasper DR) to RNN-based model (DeepSpeech2)
-
•
CNNDR CNNMini: From CNN-based model (Jasper DR) to CNN-based model (Jasper Mini)
We experimented with TutorNet in both CNNDR RNNDS and CNNDR CNNMini scenarios with AISHELL-2 dataset. For the teacher model, we used pre-trained Jasper DR (CER % on iOS dev dataset), which was provided by NeMo toolkit. When training the CNN-based and RNN-based student models, we utilized OpenSeq2Seq toolkit. Table VIII summarizes the results on AISHELL-2. The results show that TutorNet still works well with the Mandarin dataset. When distilling knowledge to RNNDS, the student provided 12.30 % in dev iOS dataset. In the case of CNNDR CNNMini, TutorNet achieved WER 10.65 % and RERR 9.52 % in dev iOS dataset.
V-C The Applicability of TutorNet to the Other End-to-End Speech Recognition Models
In the previous experiments, we mainly focus on KD between CTC-based models, especially across different neural networks. For speech recognition, besides the CTC-based model, there have been various types of end-to-end models such as Transformer and AED model. If we can transfer the knowledge regardless of the types of models, a more flexible KD will be possible. To check the applicability of the proposed method to the other types of models, we conducted two different scenarios with LibriSpeech dataset:
-
•
Transformer CTC-based model: From Transformer to CTC-based model (DeepSpeech2)
-
•
Transformer AED model: From Transformer to AED model (VGG-BLSTM)
V-C1 Transformer CTC-based model
![]() |
WER (%) | |||||
---|---|---|---|---|---|
Model | TutorNet | dev | test | ||
clean | other | clean | other | ||
Transformer | - | 3.21 | 8.58 | 3.45 | 8.45 |
CTC-based model | - | DNC | |||
RKD for 100 steps | 8.82 | 22.72 | 8.84 | 23.55 | |
RKD | 8.26 | 22.05 | 8.20 | 22.73 | |
RKD+SKD | 7.86 | 21.95 | 8.11 | 22.62 |
WER (%) | RERR (%) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Model | TutorNet | LM | dev | test | dev | test | ||||
clean | other | clean | other | clean | other | clean | other | |||
Transformer | - | - | 3.21 | 8.58 | 3.45 | 8.45 | - | - | - | - |
AED | - | - | 5.80 | 16.80 | 5.75 | 17.50 | - | - | - | - |
RKD | 4.89 | 15.15 | 4.91 | 15.65 | 15.69 | 9.82 | 14.61 | 10.57 | ||
RKD+SKD | 4.66 | 15.00 | 4.87 | 15.35 | 19.67 | 10.71 | 15.30 | 12.29 | ||
Transformer | - | RNNLM | 2.82 | 7.21 | 3.07 | 7.17 | - | - | - | - |
AED | - | RNNLM | 4.00 | 12.49 | 4.25 | 13.04 | - | - | - | - |
RKD | 3.85 | 11.66 | 3.86 | 12.38 | 3.75 | 6.65 | 9.18 | 5.06 | ||
RKD+SKD | 3.57 | 11.16 | 3.68 | 11.80 | 10.75 | 10.65 | 13.41 | 9.51 |
Firstly, we conducted experiments on KD from Transformer model to the CTC-based model (RNNDS). In the case of Transformer, we used the pre-trained model provided by the ESPnet toolkit. When training the CTC-based student, we utilized the OpenSeq2Seq toolkit. Since Transformer commonly uses BPE tokens as the output units, the CTC-based model was trained with the same BPE units of Transformer (about 5000 tokens). The detailed procedure for Transformer CTC-based model is illustrated in Fig. 7. For RKD, we selected the last layer of the Transformer encoder to transfer the hidden representation. After the RKD initialization, via SKD, the CTC-based model was trained with the softmax values provided by Transformer. However, there was a convergence problem in training the CTC-based baseline with BPE units. When the CTC-based model was trained with a single BPE CTC loss, random initialization and without any pre-training, it failed to converge, as in [46, 47]. Therefore, we adopted the CTC model, initialized with RKD over the first 100 steps, as the student baseline for the comparison of the performance. Table IX summarizes the WER results on LibriSpeech. From the experimental results, we verified that the proposed method not only helps the convergence of the BPE-level CTC model but also can connect Transformer/CTC models in the KD task. When appying both RKD and SKD, the BPE-level CTC model achieved WER 7.86 % in dev clean.
V-C2 Transformer AED model
In addition to the previous experiments of Transformer CTC-based model, we proceeded to verify whether the AED model can benefit from the guidance of Transformer model via TutorNet. We employed a pre-trained Transformer model, provided by ESPnet toolkit, as the teacher model. We also trained the AED student model with the ESPnet toolkit. The detailed procedure of the proposed method is shown in Fig. 7. For RKD, we selected the encoder’s last layer to transfer the hidden representation for both teacher and student models. When training SKD, considering that both models are based on hybrid CTC/attention, we applied the softmax value on the CTC side. Table X reports the WER and RERR performance of each stage. From the results, it is confirmed that TutorNet showed significant improvements in Transformer AED model. Each stage of TutorNet was found to be useful for training the AED student model, which means that TutorNet is well applied in the case of Transformer AED model. In the dev-clean dataset, the student with RKD achieved WER 4.89 % with greedy decoding and WER 3.85 % during RNNLM decoding. It means that the AED model can be effectively trained using the hidden representation of Transformer via TutorNet. Our best performance of AED model was achieved in the case of using both RKD and SKD, provided with WER 4.66 % with greedy and WER 3.57 % with RNNLM. It is interesting that SKD does not only matter for the CTC loss but also help to improve the attention loss in the case of the hybrid CTC/attention model.
VI Conclusion
In this paper, we proposed a new KD method, TutorNet. This framework copes with the shortcoming of the conventional KD, limiting the flexibility of model selection since the student model structure should be similar to that of the given teacher. Through a number of experiments on LibriSpeech and AISHELL-2, we confirmed that TutorNet significantly contributes to improving the WER performance of the distilled student in an unexplored setting where the architecture of the student is inherently different from that of the teacher. In some particular configurations, it allows the student to outperform its teacher. Furthermore, selecting a suitable teacher model for each training is possible via TutorNet, implying that we can have more flexibility in selecting teacher or student models. For speech recognition, there have been various types of end-to-end models, and each model has its own advantages. As various end-to-end speech recognition models can be flexibly selected, TutorNet represents a significant step toward KD in the speech recognition task. We expect the application of TutorNet not to be restricted to the modality or model architecture of the tasks and would like to examine its utility via cross-domain studies in the future.
References
- [1] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML, pages 369–376, 2006.
- [2] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for speech recognition. In Proc. NIPS, pages 577–585, 2015.
- [3] A. Graves. Sequence transduction with recurrent neural networks. In Proc. ICML Workshop on Representation Learning, 2012.
- [4] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. Deep speech 2: end-to-end speech recognition in english and mandarin. In Proc. ICML, pages 173–182, 2016.
- [5] R. Collobert, C. Puhrsch, and G. Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193, 2016.
- [6] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde. Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288, 2019.
- [7] S. Kriman, K. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang. Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. arXiv preprint arXiv:1910.10261, 2019.
- [8] A. Vaswani et al. Attention is all you need. In Proc. NIPS, pages 5998––6008, 2017.
- [9] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In Proc. NIPS Workshop Deep Learn., 2014.
- [10] J. Ba and R. Caruana. Do deep nets really need to be deep? In Proc. NIPS, pages 2654––2662, 2014.
- [11] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: hints for thin deep nets. In Proc. ICLR, 2015.
- [12] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proc. ICLR, 2017.
- [13] J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proc. CVPR, 2017.
- [14] S. Srinivas and F. Fleuret. Knowledge transfer with jacobian matching. In Proc. ICML, 2018.
- [15] J. Li, R. Zhao, T. J. Huang, and Y. Gong. Learning small-size dnn with output-distribution-based criteria. In Proc. INTERSPEECH, 2014.
- [16] Y. Chebotar and A. Waters. Distilling knowledge from ensembles of neural networks for speech recognition. In Proc. INTERSPEECH, pages 3439–3443, 2016.
- [17] S. Watanabe, T. Hori, J. L. Roux, and J. R. Hershey. Student-teacher network learning with enhanced features. In Proc. ICASSP, pages 5275–5279, 2017.
- [18] L. Lu, M. Guo, and S. Renals. Knowledge distillation for small-footprint highway networks. In Proc. ICASSP, pages 4820–4824, 2017.
- [19] T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and B. Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. In Proc. INTERSPEECH, pages 3697–3701, 2017.
- [20] K. J. Geras et al. Blending lstms into cnns. In Proc. ICLR Workshop, 2016.
- [21] J. H. M. Wong and M. J. F. Gales. Sequence student-teacher training of deep neural networks. In Proc. INTERSPEECH, pages 2761––2765, 2016.
- [22] J. H. M. Wong, M. J. F. Gales, and Y. Wang. General sequence teacher–student learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11):1725––1736, 2019.
- [23] A. Senior, H. Sak, F. C. Quitry, T. Sainath, K. Rao, et al. Acoustic modelling with cd-ctc-smbr lstm rnns. In Proc. ASRU, pages 604–609, 2015.
- [24] R. Takashima, S. Li, and H. Kawai. An investigation of a knowledge distillation method for ctc acoustic models. In Proc. ICASSP, pages 5809–5813, 2018.
- [25] R. Takashima, S. Li, and H. Kawai. Investigation of sequence-level knowledge distillation methods for ctc acoustic models. In Proc. ICASSP, pages 6156–6160, 2019.
- [26] G. Kurata and K. Audhkhasi. Improved knowledge distillation from bi-directional to uni-directional lstm ctc for end-to-end speech recognition. In Proc. SLT, pages 411–417, 2018.
- [27] G. Kurata and K. Audhkhasi. Guiding ctc posterior spike timings for improved posterior fusion and knowledge distillation. In Proc. INTERSPEECH, pages 1616–1620, 2019.
- [28] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk. Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Proc. ICASSP, pages 4280–4284. IEEE, 2015.
- [29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Proc. ICASSP, pages 5206–5210, 2015.
- [30] J. Du, X. Nai, X. Liu, and H. Bu. Aishell-2: transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583, 2018.
- [31] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi. Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240––1253, 2017.
- [32] Y. Zhang, W. Chan, and N. Jaitly. Very deep convolutional networks for end-to-end speech recognition. In Proc. ICASSP, pages 4845––4849. IEEE, 2017.
- [33] T. Hori, S. Watanabe, Y. Zhang, and W. Chan. Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm. In Proc. INTERSPEECH, pages 949––953, 2017.
- [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [35] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pages 4960––4964. IEEE, 2016.
- [36] O. Kuchaiev, B. Ginsburg, I. Gitman, V. Lavrukhin, J. Li, H. Nguyen, C. Case, and P. Micikevicius. Mixed-precision training for nlp and speech recognition with openseq2seq. arXiv preprint arXiv:1805.10387, 2018.
- [37] S. Watanabe et al. Espnet: end-to-end speech processing toolkit. In Proc. INTERSPEECH, pages 2207––2211, 2018.
- [38] O. Kuchaiev et al. Nemo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577, 2019.
- [39] D. P. Kingma and J. Ba. Adam: a method for stochastic optimization. In Proc. ICLR, 2015.
- [40] B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, H. Nguyen, and J. M. Cohen. Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286, 2019.
- [41] K. Heafield. Kenlm: faster and smaller language model queries. In Proc. EMNLP, 2011.
- [42] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proc. ACL, pages 1715––1725, 2016.
- [43] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701v1, 2012.
- [44] A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML, pages 1764–1772, 2014.
- [45] D. S. Park, W. Chan, C. Zhang, Y. Chiu, B. Zoph, E. D. Cubuk, and Le. Q. V. Specaugment: a simple data augmentation method for automatic speech recognition. In Proc. INTERSPEECH, pages 2613–2617, 2019.
- [46] A. Garg, O. Gowda, A. Kumar, K. Kim, M. Kumar, and Kim C. Improved multi-stage training of online attention-based encoder-decoder models. In Proc. ASRU, pages 70––77, 2019.
- [47] K. Audhkhasi, B. Ramabhadran, G. Saon, and D. Picheny, M. Nahamoo. Direct acoustics-to-word models for english conversational speech recognition. In Proc. INTERSPEECH, pages 959––963, 2017.