Improved Mask-CTC for Non-Autoregressive End-to-End ASR

Abstract

For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% $\rightarrow$ 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed ( $<$ 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.

Index Terms— Non-autoregressive sequence generation, connectionist temporal classification, end-to-end speech recognition, end-to-end speech translation

1 Introduction

End-to-end automatic speech recognition (E2E-ASR) is a task which directly converts speech into text based on sequence-to-sequence modeling [1, 2, 3, 4]. Compared with the traditional hybrid system built upon separately trained modules [5], E2E-ASR greatly simplifies its model training and inference, reducing the cost of model development and the requirement of computational resources. Besides, E2E-ASR has achieved comparable results with those of the hybrid system in diverse tasks [6, 7, 8, 9]. Recent model architectures and techniques have further enhanced the performance of E2E-ASR [10, 11, 12, 13, 14, 15].

Many of the previous studies on E2E-ASR focus on an autoregressive (AR) model [3, 4], which estimates the likelihood of a sequence generation based on a left-to-right probabilistic chain rule. The AR model often performs the best among other E2E-ASR architectures [6]. However, it suffers from slow inference speed, requiring $L$ -step incremental calculations of the model to generate $L$ tokens. While some neural models, including Transformer [16], permit for efficient inference using GPU, it becomes highly computation intensive in production environments. For example, mobile devices are often constrained by limited on-device resources, where only CPU is generally available for the model inference¹¹1It may be argued that the inference can be done on a cloud server but running the model entirely on-device is important from privacy perspective [9].. If the inference algorithm is light and simple, ASR can be performed on-device at low-power consumption with fast recognition speed. Building E2E-ASR models with fast inference speed while suppressing the model calculations is a critical factor for the real-world deployment.

Contrary to the AR framework, a non-autoregressive (NAR) model generates a sequence within a constant number of the inference steps [1, 17]. NAR models in neural machine translation have achieved competitive performance to AR models [18, 19, 20, 21, 22, 23, 24, 25]. Several approaches have been proposed to realize a NAR model in E2E-ASR [26, 27, 28, 29, 30, 31]. Connectionist temporal classification (CTC) predicts a frame-wise latent alignment between input speech frames and output tokens, and generates a target sequence based on a conditional independence assumption between the frame predictions [26]. However, such assumption limits the performance of CTC compared with that of AR models [32]. Imputer [28] effectively models the contextual dependencies by iteratively predicting the latent alignments of CTC based on mask prediction [33, 22]. Despite achieving comparable performance with that of AR models, Imputer processes the frame-level sequence, consisting of hundreds of units, using the self-attention layers [16], which cost computations proportional to the square of a sequence length. On the other hand, Mask-CTC [29] generates a sequence shorter than Imputer by refining a token-level CTC output with the mask prediction, achieving fast inference speed of less than 0.1 real time factor (RTF) using CPU.

In this work, we focus on improving Mask-CTC based E2E-ASR, which suits for the production owing to its fast inference. Mask-CTC starts a sequence generation from an output of CTC to avoid the cumbersome length prediction of a target sequence, which is required in the previous NAR framework [17]. However, Mask-CTC is confronted with some problems related to this dependence on the CTC output. Firstly, the performance of Mask-CTC is strongly limited to that of CTC since only minor errors in the CTC output, such as spelling mistakes, are subjected to the refinement. Secondly, the length of a target sequence is fixed with that of the CTC output throughout the decoding, making it difficult to recover deletion or insertion errors in the CTC output. To overcome the former, we adopt Conformer [15], which yields better speech processing than Transformer, to improve the CTC performance. For the latter, we introduce new training and decoding strategies to handle the deletion and insertion errors during inference. We also explore the potential of Mask-CTC to the end-to-end speech translation (E2E-ST) task.

2 Non-autoregressive End-to-End ASR

E2E-ASR aims to model a conditional probability $P(Y|X)$ , which directly maps a $T$ -length input sequence $X=(\bm{\mathrm{x}}_{t}\in\mathbb{R}^{D}|t=1,...,T)$ into a $L$ -length output sequence $Y=(y_{l}\in\mathcal{V}|l=1,...,L)$ . Here, $\bm{\mathrm{x}}_{t}$ is a $D$ -dimensional acoustic feature at frame $t$ , $y_{l}$ is an output token at position $l$ , and $\mathcal{V}$ is a vocabulary.

In this section, we review non-autoregressive (NAR) models, which perform sequence generation within a constant number of inference steps in parallel with token positions.

2.1 Connectionist temporal classification (CTC)

CTC predicts a frame-level input-output alignment $A=(a_{t}\in\mathcal{V}\cup\{\epsilon\}|t=1,...,T)$ by introducing a special blank symbol $\epsilon$ [1]. Based on a conditional independence assumption per frame, CTC models the probability $P(Y|X)$ by marginalizing over all possible paths as:

P_{\mathrm{ctc}}(Y|X)=\sum_{A\in\beta^{-1}(Y)}P(A|X),

(1)

where $\beta^{-1}(Y)$ denotes all possible alignments compatible with $Y$ . The conditional independence assumption enables the model to perform NAR sequence generation in a single forward pass.

2.2 Mask-CTC

Due to the conditional independence assumption, the CTC-based model generally suffers from poor recognition performance [6]. Mask-CTC has been proposed to mitigate this problem by iteratively refining an output of CTC with bi-directional contexts of tokens [29]. Mask-CTC adopts an encoder-decoder model built upon Transformer blocks [16]. CTC is applied to the encoder output [34, 12] and the decoder is trained via the masked language model (MLM) objective [33, 22]. For training the MLM decoder, randomly sampled tokens $Y_{\mathrm{mask}}$ are replaced by a special mask token <MASK>. $Y_{\mathrm{mask}}$ are then predicted conditioning on the rest observed (unmasked) tokens $Y_{\mathrm{obs}}$ $(=Y\setminus Y_{\mathrm{mask}})$ as follows:

P_{\mathrm{mlm}}(Y_{\mathrm{mask}}|Y_{\mathrm{obs}},X)=\prod_{y\in Y_{\mathrm{mask}}}P(y|Y_{\mathrm{obs}},X).

(2)

The number of masked tokens $N_{\mathrm{mask}}$ $(=|Y_{\mathrm{mask}}|)$ are sampled from a uniform distribution of $1$ to $L$ as in [22, 23]. With the CTC objective in Eq. (1), Mask-CTC optimizes model parameters by maximizing the following log-likelihood as:

\mathcal{L}_{\mathrm{NAR}}=\alpha\log P_{\mathrm{ctc}}(Y|X)+\\ (1-\alpha)\log P_{\mathrm{mlm}}(Y_{\mathrm{mask}}|Y_{\mathrm{obs}},X),

(3)

where $\alpha$ $(0\leq\alpha\leq 1)$ is a tunable parameter.

During inference, an output of CTC is obtained through greedy decoding by suppressing repeated tokens and removing blank symbols. Then, low-confidence tokens are replaced with <MASK> by thresholding the posterior probabilities of CTC with $P_{\mathrm{thres}}$ . The resulting unmasked tokens are then fed into the MLM decoder to predict the masked tokens. With this two-pass inference based on the CTC and MLM decoding, the errors in the CTC output, caused by the conditional independence assumption, are expected to be corrected conditioning on the high-confidence tokens in the entire output sequence. The CTC output can be further improved by gradually filling in the masked tokens in multiple steps $K$ . At the $n$ -th step, $y_{l}\in Y_{\mathrm{mask}}^{(n)}$ is predicted as follows:

y_{l}=\operatorname*{argmax}_{y}P_{\mathrm{mlm}}(y_{l}=y|Y_{\mathrm{obs}}^{(n)},X),

(4)

where top $C$ positions with the highest decoder probabilities are selected to be predicted at each iteration. By setting $C=\lfloor N_{\mathrm{mask}}/K\rfloor$ , inference can be completed in constant $K$ steps.

3 Proposed Improvements of Mask-CTC

Making use of the CTC output during inference, Mask-CTC effectively avoids the cumbersome length prediction of a target sequence [17], which is rather challenging in ASR [27, 29]. However, there are some drawbacks to this dependence on the CTC output. Firstly, the performance of CTC limits that of Mask-CTC because the MLM decoder can only make minor changes to the CTC output, such as correcting spelling errors. Secondly, as the target length is fixed with that of the CTC output throughout the decoding, it is difficult to recover deletion or insertion errors.

To tackle these problems, we propose to (1) enhance the encoder architecture by adopting the state-of-the-art encoder architecture, Conformer [15], and (2) introduce new training and decoding methods for the MLM decoder to handle the insertion and deletion errors during inference.

3.1 Conformer

Conformer integrates a convolution module and Macaron-Net style feed-forward network (FFN) [35] into the Transformer encoder block [16]. While the self-attention layer is effective at modeling long-range global context, the convolution layer increases the ability to capture local patterns in a sequence, which is effective for speech processing. We expect the Conformer encoder to boost the performance of CTC and thus improve the overall performance of Mask-CTC accordingly.

3.2 Dynamic length prediction (DLP)

3.2.1 Training

Inspired by [36, 21], we propose dynamic length prediction (DLP), in which the MLM decoder is trained so as to dynamically predict the length of a partial target sequence from a corresponding masked token at each iteration. In addition to the mask prediction task in Eq. (2), which is analogous to solving substitution errors, the decoder is trained to solve simulated deletion and insertion errors.

Deletion-simulated task makes the decoder predict the length of a partial target sequence, having “one or more” tokens, from a corresponding masked token. For example, given a ground-truth sequence $Y=[y_{1},y_{2},y_{3},y_{4}]$ and its masked sequence $[y_{1},\text{{<MASK>}},y_{4}]$ by merging $y_{2}$ and $y_{3}$ to <MASK>, the decoder is trained to predict a symbol 2 from the masked position, which corresponds to the length of the partial sequence $[y_{2},y_{3}]$ . This task makes the decoder aware of the possibility that the masked token has one or more corresponding output tokens, simulating a recover from deletion error in the decoder inputs. To generate such masks, partial tokens in $Y$ are randomly sampled and replaced with <MASK> as $Y_{\mathrm{mask}}$ in Eq. (2). Then the consecutive masks are integrated into one single mask to form a masked sequence $Y^{\mathrm{del}}$ ( $|Y^{\mathrm{del}}|\leq|Y|$ ), consisting of masked tokens $Y_{\mathrm{mask}}^{\mathrm{del}}$ and observed tokens $Y_{\mathrm{obs}}^{\mathrm{del}}$ $(=Y^{\mathrm{del}}\setminus Y_{\mathrm{mask}}^{\mathrm{del}})$ . The target length labels, $D_{\mathrm{del}}=\{d_{i}\in\mathcal{Z}|i=1,...,|Y_{\mathrm{mask}}^{\mathrm{del}}|\}$ , are obtained from the above mask integration process. $D_{\mathrm{del}}$ is predicted conditioning on the observed tokens $Y_{\mathrm{obs}}^{\mathrm{del}}$ as:

P_{\mathrm{lp}}(D_{\mathrm{del}}|Y_{\mathrm{obs}}^{\mathrm{del}},X)=\prod_{i}P(d_{i}|Y_{\mathrm{obs}}^{\mathrm{del}},X),

(5)

where $P_{\mathrm{lp}}$ is a posterior probability distribution of the mask length.

Insertion-simulated task makes the decoder predict “zero” from a masked token, indicating the mask corresponds to no partial target sequence. For example, given a ground-truth sequence $Y=[y_{1},y_{2},y_{3},y_{4}]$ and a masked sequence $[y_{1},y_{2},y_{3},\text{{<MASK>}},y_{4}]$ , the decoder is trained to predict a symbol 0 from the masked position as there is no corresponding tokens in $Y$ . This way, we can make the decoder aware of the redundant masked token, simulating a recover from insertion error in the decoder inputs. To obtain masks for this task, we randomly insert <MASK> into $Y$ to form a masked sequence $Y^{\mathrm{ins}}$ ( $|Y^{\mathrm{ins}}|>|Y|$ ), consisting of masked tokens $Y_{\mathrm{mask}}^{\mathrm{ins}}$ and observed tokens $Y_{\mathrm{obs}}^{\mathrm{ins}}$ $(=Y^{\mathrm{ins}}\setminus Y_{\mathrm{mask}}^{\mathrm{ins}})$ . The target lengths $D_{\mathrm{ins}}=\{d_{i}=0|i=1,...,|Y_{\mathrm{mask}}^{\mathrm{ins}}|\}$ of the inserted masks are predicted as:

P_{\mathrm{lp}}(D_{\mathrm{ins}}|Y_{\mathrm{obs}}^{\mathrm{ins}},X)=\prod_{i}P(d_{i}|Y_{\mathrm{obs}}^{\mathrm{ins}},X).

(6)

Both tasks are trained jointly by a shared single linear layer followed by a softmax classifier (we set the maximum class to 50) on top of the decoder. The objective of the overall DLP is formulated by combining Eqs. (5) and (6) as:

\mathcal{L}_{\mathrm{LP}}=\log P_{\mathrm{lp}}(D_{\mathrm{del}}|Y_{\mathrm{obs}}^{\mathrm{del}},X)+\log P_{\mathrm{lp}}(D_{\mathrm{ins}}|Y_{\mathrm{obs}}^{\mathrm{ins}},X).

(7)

Finally, a new Mask-CTC model is trained with a loss combining the conventional objective $\mathcal{L}_{\mathrm{NAR}}$ from Eq. (3) and the objective of the proposed DLP $\mathcal{L}_{\mathrm{LP}}$ from Eq. (7) as follows:

\mathcal{L}_{\mathrm{NAR-LP}}=\mathcal{L}_{\mathrm{NAR}}+\beta\mathcal{L}_{\mathrm{LP}},

(8)

where $\beta$ $(>0)$ is a tunable parameter.

3.2.2 Inference

Algorithm 1 Shrink-and-Expand Decoding

K

: iteration step,

X

: intput speech,

\hat{Y}=\{\hat{y}_{l}\}_{l=1}^{L}

: CTC output

2:Calculate

P_{l,\mathrm{mlm}}=P_{\mathrm{mlm}}(y_{l}=\hat{y_{l}}|\hat{Y},X)

for each

\hat{y}_{l}\in\hat{Y}

3:Obtain

\hat{Y}_{\mathrm{mask}}=\{\hat{y}_{l}|P_{l,\mathrm{mlm}}<P_{\mathrm{thres}}\}_{l=1}^{L}

4:Mask

\hat{Y}

, where

\hat{Y}=\begin{cases}\varnothing&(\hat{y}_{l}\in\hat{Y}_{\mathrm{mask}})\quad\text{\footnotesize{// <MASK>}}\\ \hat{y}_{l}&(\hat{y}_{l}\in\hat{Y}_{\mathrm{obs}}=\hat{Y}\setminus\hat{Y}_{\mathrm{mask}})\end{cases}

C=\lfloor|\hat{Y}_{\mathrm{mask}}|/K\rfloor

// #masks predicted in each loop

6:while stopping criterion not met do

7: // Shrink (Ex.

[\hat{y}_{1},\varnothing,\varnothing,\varnothing,\hat{y}_{5},\varnothing,\hat{y}_{7}]\rightarrow[\hat{y}_{1},\varnothing,\hat{y}_{5},\varnothing,\hat{y}_{7}]

)

8: Shrink masks in

\hat{Y}

and update

\hat{Y}_{\mathrm{mask}}

\hat{Y}_{\mathrm{obs}}

accordingly

10: // Expand (Ex.

[\hat{y}_{1},\varnothing,\hat{y}_{5},\varnothing,\hat{y}_{7}]\rightarrow[\hat{y}_{1},\varnothing,\varnothing,\hat{y}_{5},\hat{y}_{7}]

)

11: Calculate

P_{l,\mathrm{lp}}(d_{l}|\hat{Y}_{\mathrm{obs}},X)

for each

\hat{y}_{l}\in\hat{Y}_{\mathrm{mask}}

12: Expand masks in

\hat{Y}

based on

\operatorname*{argmax}_{d}P_{l,\mathrm{lp}}(d_{l}=d|\hat{Y}_{\mathrm{obs}},X)

13: and update

\hat{Y}_{\mathrm{mask}}

\hat{Y}_{\mathrm{obs}}

accordingly

14:

15: Calculate

P_{l,\mathrm{mlm}}(y_{l}|\hat{Y}_{\mathrm{obs}},X)

for each

\hat{y}_{l}\in\hat{Y}_{\mathrm{mask}}

16: Predict masks in

\hat{Y}

\operatorname*{argmax}_{y}P_{l,\mathrm{mlm}}(y_{l}=y|\hat{Y}_{\mathrm{obs}},X)

17: where

\hat{y}_{l}

with top-

C

highest probabilities are selected

18: Update

\hat{Y}_{\mathrm{mask}}

\hat{Y}_{\mathrm{obs}}

19:end while

20:return

\hat{Y}

Alg. 1 shows the proposed shrink-and-expand decoding algorithm, which allows delete and insert tokens during sequence generation. Compared with the conventional method [29] (explained in Sec. 2.2), the proposed decoding differs in the masking process of the CTC output and the prediction process of the masked tokens.

While the previous method detects low-confidence tokens in a CTC output by thresholding the posterior probabilities of CTC, the proposed decoding refers to the probabilities of the MLM decoder (line 2 in Alg. 1). Taking advantage of the bi-directional contexts of tokens, it appeared that the decoder probabilities are more suitable for detecting the CTC errors.

Shrink-and-expand decoding dynamically changes the target sequence length by deleting or inserting masks at each iteration. Shrink step (line 7 in Alg. 1) integrates consecutive masks into one single mask. The integrated masks are then fed to the decoder to predict the length of each mask (line 9 in Alg. 1) required to match the length of an expected target sequence. Expand step (line 10 in Alg. 1) modifies the number of each mask based on the predicted length. For example, if the length of a mask is predicted as 2, we insert a mask to form two consecutive masks, and if predicted as 0, we delete the mask from the target sequence. Finally, the masked tokens are predicted as the previous way in Eq. (4). Note that shrink-and-expand decoding requires two forward calculations of the decoder for each inference step: one for predicting the length of each mask and one for predicting target tokens.

4 Experiments

Table 1: Word error rate (WER) and real time factor (RTF) on WSJ. The proposed improvements on Mask-CTC (the use of Conformer and dynamic length prediction) are compared with CTC and autoregressive (AR) models.

K

denotes the number of inference steps required to generate each output token. RTF was measured on dev93 using CPU. For each Mask-CTC model, RTF was calculated for

K=10

Model	#Params (M)	WSJ (WER)								RTF	Speedup
Model	#Params (M)	dev93				eval92				RTF	Speedup
Autoregressive		$K\!=\!L$ (avg. 99.9)				$K\!=\!L$ (avg. 100.3)
A1 Transformer-AR [12]	27.2	13.5				10.8				0.456_±0.005	1.00 $\times$
A2 + beam search	27.2	12.8				10.6				5.067_±0.012	0.09 $\times$
A3 Conformer-AR	30.4	11.4				8.8				0.474_±0.009	0.96 $\times$
A4 + beam search	30.4	11.1				8.5				5.094_±0.031	0.09 $\times$
Non-autoregressive		$K\!\leq C$ ( $C$ : const.)				$K\!\leq C$ ( $C$ : const.)
Non-autoregressive		0	1	5	10	0	1	5	10
B1 Transformer-CTC	17.7	19.4	–	$-$	$-$	15.5	$-$	$-$	$-$	0.021_±0.000	21.71 $\times$
B2 Transformer-Mask-CTC [29]	27.2	15.5	15.2	14.9	14.9	12.5	12.2	12.0	12.0	0.063_±0.001	7.24 $\times$
B3 + dynamic length prediction	27.2	15.5	14.0	13.9	13.8	12.4	11.1	10.8	10.8	0.074_±0.001	6.16 $\times$
\cdashline1-12 C1 Conformer-CTC	20.9	13.0	$-$	$-$	$-$	10.8	$-$	$-$	$-$	0.033_±0.000	13.81 $\times$
C2 Conformer-Mask-CTC	30.4	11.9	11.8	11.7	11.7	9.4	9.2	9.2	9.1	0.063_±0.000	7.24 $\times$
C3 + dynamic length prediction	30.4	11.8	11.3	11.3	11.3	9.6	9.3	9.1	9.1	0.080_±0.000	5.70 $\times$

Table 2: Word error rates (WER) on Voxforge Italian and TEDLIUM2. Results with beam search are reported in parentheses.

Model	#itr $K$	Voxforge	TEDLIUM2
TF-AR [12]	$L$	35.5 (35.7)	9.5 (8.9)
CF-AR	$L$	29.8 (29.8)	8.4 (7.9)
TF-CTC	0	56.1	16.6
TF-Mask-CTC [29]	10	38.3	10.9
+ DLP	5	35.1	10.6
\cdashline1-4 CF-CTC	0	31.8	9.5
CF-Mask-CTC	10	29.2	8.6
+ DLP	5	29.0	9.7

To evaluate the effectiveness of the proposed improvements of Mask-CTC, we conducted experiments on E2E-ASR models using ESPnet [37]. The recognition performance and the inference speed were evaluated based on word error rate (WER) and real time factor (RTF), respectively. All of the decodings were done without using external language models (LMs).

4.1 Datasets

The experiments were carried out using three tasks: the 81 hours Wall Street Journal (WSJ) [38], the 210 hours TEDLIUM2 [39] and the 16 hours Voxforge in Italian [40]. For the network inputs, we used 80 mel-scale filterbank coefficients with three-dimensional pitch features extracted using Kaldi [41]. To avoid overfitting, we chose data augmentation techniques from speed perturbation [42] and SpecAugment [43], depending on the tasks and models. For the tokenization of target texts, we used characters for WSJ and Voxforge. For TEDLIUM2, we generated a 500 subword vocabulary based on Byte Pair Encoding (BPE) algorithm [44].

4.2 Experimental setup

For all the tasks, we adopted the same Transformer [16] architecture as in [12], where the number of heads $H$ , the dimension of a self-attention layer $d^{\mathrm{att}}$ , the dimension of a feed-forward layer $d^{\mathrm{ff}}$ were set to 4, 256, and 2048, respectively. The encoder consisted of 2 CNN-based downsampling layers followed by 12 self-attention layers and the decoder consisted of 6 self-attention layers. For Conformer encoder [15], we used the same configuration for the self-attention layer as in Transformer, except $d^{\mathrm{ff}}$ was set to $1024$ .²²2We adjusted $d_{\mathrm{ff}}$ to avoid the Conformer-based model becoming slow during inference due to the increase in number of parameters. For the model training, we tuned hyper-parameters (e.g., training epochs, etc.) following the recipes provided by ESPnet and we will make our configurations publicly available on ESPnet to ensure reproducibility. The loss weight $\alpha$ in Eq. (3) was set to 0.3 and $\beta$ in Eq. (8) was set to 0.1 for Voxforge and 1.0 for WSJ and TEDLIUM2. After the training, a final model was obtained by averaging model parameters over the 10 – 50 checkpoints with the best validation performance. During inference, the threshold $P_{\mathrm{thres}}$ for the conventional Mask-CTC (explained in Sec. 2.2) was tuned from values of 0.9, 0.99, 0.999. For the proposed decoding in Sec. 3.2, we fixed $P_{\mathrm{thres}}$ to 0.5 for all tasks. RTF was measured using utterances in dev93 of WSJ using Intel(R) Xeon(R) Gold 6148 CPU, 2.40GHz.

4.3 Evaluated models

We evaluated different E2E-ASR models, each of which can either be autoregressive (AR) or non-autoregressive (NAR). AR indicates an AR model trained with the joint CTC-attention objective [34, 12] and, for inference, we used the joint CTC-attention decoding [45] with beam search. CTC indicates a NAR model simply trained with the CTC objective as in Eq. (1) [26]. Mask-CTC indicates a NAR model trained with the conventional Mask-CTC framework [29] as explained in Sec. 2.2. We also applied the proposed dynamic length prediction (DLP, explained in Sec. 3.2) to Mask-CTC.

For each model, we compared the results between Transformer (TF) and Conformer (CF) encoders.

4.4 Results

Table 4 shows the results on WSJ. By comparing the results among NAR models using Transformer (B*), the proposed DLP (B3) outperformed the conventional Mask-CTC (B2). We can conclude that DLP successfully recovered insertion and deletion errors in the CTC output, looking at B2 and B3 having the same CTC results ( $K=0$ ) and B3 resulted in better improvement. The performance of CTC significantly improved by using Conformer (B1, C1) and Mask-CTC greatly benefited from it (C2). The errors were further reduced by applying DLP (C3), achieving 9.1% in eval92 which was the best among the NAR models and better than that of the state-of-the-art model without LM [46, 47]. By comparing results between NAR and AR models, Mask-CTC achieved highly competitive performance to AR models for both Transformer (A1, B3) and Conformer (A3, C3), demonstrating the effectiveness of the proposed methods for improving the original Mask-CTC.

In terms of the decoding speed using CPU, all NAR models (B*, C*) were at least 5.7 times faster than the AR models (A*), achieving RTF of under 0.1. The speed of Mask-CTC slowed down by applying DLP (B3, C3) due to the increase in number of decoder calculations (explained in Sec. 3.2.2). However, with DLP, the error rates converged faster in less iterations ( $K=5$ ) and thus the inference speed was the same or even faster than the original Mask-CTC.

Table 4 shows the results on Voxforge and TEDLIUM2. Mask-CTC achieved comparable results with those of AR models by the proposed improvements. However, DLP did not improve CF-Mask-CTC on TEDLIUM2. We observed the performance of CF-Mask-CTC was accurate enough and DLP was not effective.

4.5 End-to-end speech translation (E2E-ST)

To see a potential application to other speech tasks, we applied Mask-CTC framework to the E2E-ST task, following [48]. For NAR models, we used sequence-level knowledge distillation [49]. Table 3 shows the results on Fisher-CallHome Spanish corpus [50]. Since input-output alignments are non-monotonic in this task, we observed the confidence filtering based on the CTC probabilities did not work well, unlike ASR. Next, we performed the mask-predict (MP) decoding proposed in MT [22] by starting from all <MASK> and confirmed some gains over CTC. Finally, we initialized a target sequence with the filtered CTC output as in the ASR task and then performed the MP decoding. Here, the number of masked tokens at each iteration are truncated by the number of initial masked tokens $N_{\mathrm{mask}}$ (restricted MP) to keep information from the CTC output for the later iterations. This way, the results were further improved from the CTC greedy results by a large margin. Moreover, interestingly, Mask-CTC outperformed the AR model on this corpus.

Table 3: Speech translation results on Fisher-CallHome Spanish

Model	Fisher (BLEU)			CallHome (BLEU)
Model	dev	dev2	test	devtest	evltest
TF-AR	47.01	47.89	47.19	18.11	17.95
TF-CTC	45.57	46.97	45.97	15.99	15.91
TF-Mask-CTC
+ CTC greedy	45.93	46.82	46.17	15.73	15.60
+ original decoding	44.80	45.40	44.39	14.14	14.14
+ mask-predict (MP)	47.43	48.14	46.96	16.52	16.42
+ restricted MP	49.94	49.42	48.66	16.96	16.79

5 Conclusion

This paper proposed improvements of Mask-CTC based non-autoregressive E2E-ASR. We adopted the Conformer encoder to boost the performance of CTC and introduced new training tasks for the model to dynamically delete or insert tokens during inference. The experimental results demonstrated the effectiveness of the improved Mask-CTC, achieving competitive performance to autoregressive models with no degradation of inference speed. We also showed Mask-CTC framework can be used for end-to-end speech translation. Our future plan is to integrate an extrenal language model into Mask-CTC while keeping the decoding speed fast.

References

[1] Alex Graves et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. of ICML, 2006, pp. 369–376.
[2] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[3] Ilya Sutskever and other, “Sequence to sequence learning with neural networks,” in Proc. of NeurIPS, 2014, pp. 3104–3112.
[4] Dzmitry Bahdanau et al., “Neural machine translation by jointly learning to align and translate,” in Proc. of ICLR, 2015.
[5] Geoffrey Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
[6] Chung-Cheng Chiu et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. of ICASSP, 2018, pp. 4774–4778.
[7] Christoph Lüscher et al., “RWTH ASR systems for librispeech: Hybrid vs attention,” in Proc. of Interspeech, 2019, pp. 231–235.
[8] Shigeki Karita et al., “A comparative study on Transformer vs RNN in speech applications,” in Proc. of ASRU, 2019, pp. 449–456.
[9] Tara N Sainath et al., “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in Proc. of ICASSP, 2020, pp. 6059–6063.
[10] Linhao Dong et al., “Speech-Transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in Proc. of ICASSP, 2018, pp. 5884–5888.
[11] Tara N Sainath et al., “Two-pass end-to-end speech recognition,” in Proc. of Interspeech, 2019, pp. 2773–2777.
[12] Shigeki Karita et al., “Improving Transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,” in Proc. of Interspeech, 2019, pp. 1408–1412.
[13] Samuel Kriman et al., “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” in Proc. of ICASSP, 2020, pp. 6124–6128.
[14] Wei Han et al., “ContextNet: Improving convolutional neural networks for automatic speech recognition with global context,” in Proc. of Interspeech, 2020.
[15] Anmol Gulati et al., “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. of Interspeech, 2020.
[16] Ashish Vaswani et al., “Attention is all you need,” in Proc. of NeurIPS, 2017, pp. 5998–6008.
[17] Jiatao Gu et al., “Non-autoregressive neural machine translation,” in Proc. of ICLR, 2018.
[18] Jindřich Libovický and Jindřich Helcl, “End-to-end non-autoregressive neural machine translation with connectionist temporal classification,” in Proc. of EMNLP, 2018, pp. 3016–3021.
[19] Jason Lee et al., “Deterministic non-autoregressive neural sequence modeling by iterative refinement,” in Proc. of EMNLP, 2018, pp. 1173–1182.
[20] Mitchell Stern et al., “Insertion Transformer: Flexible sequence generation via insertion operations,” in Proc. of ICML, 2019, pp. 5976–5985.
[21] Jiatao Gu et al., “Levenshtein Transformer,” in Proc. of NeurIPS, 2019, pp. 11181–11191.
[22] Marjan Ghazvininejad et al., “Mask-predict: Parallel decoding of conditional masked language models,” in Proc. of EMNLP-IJCNLP, 2019, pp. 6114–6123.
[23] Marjan Ghazvininejad et al., “Semi-autoregressive training improves mask-predict decoding,” arXiv preprint arXiv:2001.08785, 2020.
[24] Chitwan Saharia et al., “Non-autoregressive machine translation with latent alignments,” arXiv preprint arXiv:2004.07437, 2020.
[25] Xuezhe Ma et al., “FlowSeq: Non-autoregressive conditional sequence generation with generative flow,” in Proc. of EMNLP-IJCNLP, 2019, pp. 4273–4283.
[26] Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proceedings of ICML, 2014, pp. 1764–1772.
[27] Nanxin Chen et al., “Listen and fill in the missing letters: Non-autoregressive Transformer for speech recognition,” arXiv preprint arXiv:1911.04908, 2019.
[28] William Chan et al., “Imputer: Sequence modelling via imputation and dynamic programming,” in Proc. of ICML, 2020.
[29] Yosuke Higuchi et al., “Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict,” in Proc. of Interspeech, 2020.
[30] Yuya Fujita et al., “Insertion-based modeling for end-to-end automatic speech recognition,” in Proc. of Interspeech, 2020.
[31] Zhengkun Tian et al., “Spike-triggered non-autoregressive Transformer for end-to-end speech recognition,” in Proc. of Interspeech, 2020.
[32] Eric Battenberg et al., “Exploring neural transducers for end-to-end speech recognition,” in Proc. of ASRU, 2017, pp. 206–213.
[33] Jacob Devlin et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of NAACL-HLT, 2019, pp. 4171–4186.
[34] Suyoun Kim et al., “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. of ICASSP, 2017, pp. 4835–4839.
[35] Yiping Lu et al., “Understanding and improving transformer from a multi-particle dynamic system point of view,” in Proc. ICLR, 2020.
[36] Yi Ren et al., “FastSpeech: Fast, robust and controllable text to speech,” in Proc. of NeurIPS, 2019.
[37] Shinji Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. of Interspeech, 2018, pp. 2207–2211.
[38] Douglas B Paul and Janet M Baker, “The design for the wall street journal-based CSR corpus,” in Proc. of Workshop on Speech and Natural Language, 1992, pp. 357–362.
[39] Anthony Rousseau et al., “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in Porc. of LREC, May 2014, pp. 3935–3939.
[40] “Voxforge,” http://www.voxforge.org.
[41] Daniel Povey et al., “The Kaldi speech recognition toolkit,” in Proc. of ASRU, 2011.
[42] Tom Ko et al., “Audio augmentation for speech recognition,” in Proc. of Interspeech, 2015.
[43] Daniel S Park et al., “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. of Interspeech, 2019, pp. 2613–2617.
[44] Rico Sennrich et al., “Neural machine translation of rare words with subword units,” in Proc. of ACL, 2016, pp. 1715–1725.
[45] Takaaki Hori et al., “Joint CTC/attention decoding for end-to-end speech recognition,” in Proc. of ACL, 2017, pp. 518–529.
[46] Sara Sabour et al., “Optimal completion distillation for sequence learning,” in Proc. of ICLR, 2019.
[47] Lasse Borgholt et al., “Do end-to-end speech recognition models care about context?,” in Proc. of Interspeech, 2020.
[48] Hirofumi Inaguma et al., “ESPnet-ST: All-in-one speech translation toolkit,” in Proc. ACL: System Demonstrations, 2020, pp. 302–311.
[49] Yoon Kim et al., “Sequence-level knowledge distillation,” in Proc. of EMNLP, 2016, pp. 1317–1327.
[50] Matt Post et al., “Improved speech-to-text translation with the Fisher and Callhome Spanish–English speech translation corpus,” in Proc. of IWSLT, 2013.