ADVANCING CTC-CRF BASED END-TO-END SPEECH RECOGNITION WITH WORDPIECES AND CONFORMERS

Abstract

Automatic speech recognition systems have been largely improved in the past few decades and current systems are mainly hybrid-based and end-to-end-based. The recently proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. In this paper, we further advance CTC-CRF based ASR technique with explorations on modeling units and neural architectures. Specifically, we investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs. Experiments are conducted on two English datasets (Switchboard, Librispeech) and a German dataset from CommonVoice. Experimental results suggest that (i) Conformer can improve the recognition performance significantly; (ii) Wordpiece-based systems perform slightly worse compared with phone-based systems for the target language with a low degree of grapheme-phoneme correspondence (e.g. English), while the two systems can perform equally strong when such degree of correspondence is high for the target language (e.g. German).

Index Terms— CTC-CRF, Conformer, Wordpiece

1 Introduction

A typical Automatic Speech Recognition (ASR) system usually consists of acoustic model (AM), language model (LM) and pronunciation model (PM). Depending on how to organize the above components during training and inference, current ASR systems can be divided into two main categories, namely hybrid and end-to-end (E2E) systems. Hybrid systems usually optimize each component separately and combine AM and LM with a PM during inference stage. Such modular optimization of hybrid systems can make full use of data (i.e. unpaired-text for LM training), whereas training (i.e. forced alignment, tree-based clustering, etc.) is very complicated. In contrast, E2E systems aim to simplify the training pipeline and fold all the components into a single neural network (NN) to optimize jointly. There are various types of E2E systems, including E2E-LF-MMI-based [1], CTC-based [2, 3], Attention-based Seq2Seq [4, 5] and RNN-T [6, 7]. Generally, E2E systems can be trained in an elegant way with very promising results but tend to be data-hungry, and require tons of paired speech data. Compared with hybrid and E2E systems, the recently developed CTC-CRF [8, 9] framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. It has been shown that CTC-CRF outperforms regular CTC consistently on a wide range of benchmarks, and is on par with other state-of-the-art end-to-end models [8, 9, 10]. This work aims to further advance the CTC-CRF based ASR technique with explorations on modeling units and neural architectures.

Recently, wordpiece-based modeling units [7, 11] and Conformer neural networks [12] have demonstrated their effectiveness in other ASR architectures (hybrid, RNN-T). It has been shown in [8] that mono-char CTC-CRFs work well, but yield worse results compared with monophone CTC-CRFs. Previous work in CTC-CRFs only use BLSTM [8] and VGG-BLSTM [9] for neural architectures, which are somewhat old-fashioned. It is highly possible that wordpiece modeling units and Conformer neural networks can bring performance gains when coupled with the CTC-CRF loss. This paper tries to answer these questions through technique investigation and extensive experiments.

After an overview of related work on wordpiece modeling units and Conformer neural architectures in Section 2.1 and Section 2.2 respectively, we introduce the CTC-CRF framework (Section 3) and present the techniques pertaining to modeling units, neural architectures, data augmentation and the learning rate scheduler to train Conformers in Section 4, which enable the successful application of wordpieces and Conformer in CTC-CRFs. Then, we evaluate the techniques on three datasets (Switchboard, Librispeech and CommonVoice German) respectively in Section 5. The ablation study in Section 5.4 aims to clarify how different techniques affect the recognition performance compared to that of [8, 9]. Our main contribution is that we successfully advance CTC-CRF based ASR techniques with wordpieces and Conformers, and the experimental results are very competitive among the state-of-the-art results on the three datastes.

2 Related Work

2.1 Modeling unit

A basic problem for acoustic modeling is the choice of modeling units. Conventionally, ASR systems (i.e. GMM-HMM [13], DNN-HMM [14] etc.) heavily rely upon phonetic knowledge. These systems usually adopt phone as modeling units, where AM labels come from word-level transcripts using an expert-curated pronunciation lexicon. For languages like English with a low degree of grapheme-phoneme correspondence, it has been shown that a highly optimized pronunciation lexicon could achieve fairly well results so far [15, 16, 17]. Despite the promising results of phone-based ASR systems, creating expert-curated pronunciation lexicon is time-consuming. Thus, another direction for AM modeling units is to replace phone with grapheme to eliminate the need of lexicon construction. Early explorations on grapheme-based modeling units have shown that these systems usually perform worse than phone-based systems [18], while grapheme-based systems could achieve results as well as phone-based systems for languages with high degree of grapheme-phoneme correspondence [19]. Grapheme-based systems are not better than phone-based systems in English, until taking the context into account [20]. Recently, grapheme-based units like BPE [21] or wordpiece [22], which originate from the task of Neural Machine Translation (NMT), have been broadly applied to ASR systems, including the hybrid [11] and E2E [7] systems.

2.2 Neural architecture

Another problem for acoustic modeling is the appropriate design of AM architecture. Previously, GMM-HMM [13] is the dominant architecture for ASR, where GMM models the emission probabilities while HMM models the transition probabilities. It has been shown that replacing GMM with deep neural networks (DNN) can boost the recognition accuracy [14]. Later on, neural networks have been the dominant architecture for AM and various neural networks have been advanced and proposed. Since speech is a signal with quasi-stationary property, one of the main trends of exploring neural networks for acoustic modeling is to capture the temporal information. For example, RNN and its variants [23] have shown to model such information effectively.

Recently, the Transformer architecture based on self-attention [24, 25, 26] has been broadly applied to acoustic modeling due to its excellent ability to capture long-term dependency and high training efficiency. In addition to modeling global information, using convolutional receptive fields to capture local information has also been successfully applied in ASR [27, 28]. To leverage the global and local information simultaneously, [12] propose Conformer to combine self-attention with convolution, and such combination is illustrated in Fig. 1. Since then, Conformer has been successfully applied to several speech processing tasks [29].

Refer to caption — Fig. 1: Conformer block architecture. A Conformer block comprises of two macaron-like feed-forward modules with half-step residual connection sandwiching a multi-head self attention module and a convolution module and the following layer normalization.

3 CTC-CRF based ASR

In this section, we give a brief review of CTC-CRF based ASR. Basically, CTC-CRF is a conditional random field (CRF) with CTC topology. We first introduce the CTC method. Given an observation sequence $\bm{x}=(x_{1},x_{2},\cdots x_{T})$ , i.e. the speech feature sequence, we denote the corresponding label sequence as $\bm{l}=(l_{1},l_{2},\cdots,l_{U})$ . The CTC loss function is defined as:

\mathcal{L}(\theta)=-\text{log}\,p_{\theta}(\bm{l}|\bm{x})

(1)

where $\theta$ is the model parameters. Since $\bm{x}$ and $\bm{l}$ usually differ in length (i.e. $T\geq U$ ) and are not aligned in speech recognition, a framewise state sequence $\bm{\pi}=(\pi_{1},\pi_{2},\cdots,\pi_{T})$ is introduced in CTC to handle the alignment. In CTC, the probability of a state sequence $\bm{\pi}$ given an observation sequence $\bm{x}$ is defined as:

p_{\theta}(\bm{\pi}|\bm{x})=\prod^{T}_{t=1}p_{\theta}(\pi_{t}|\bm{x})

(2)

where $p_{\theta}(\pi_{t}|\bm{x})$ is the label posterior probability at time $t$ given input sequence $\bm{x}$ . The posterior of ${\bm{l}}$ is defined through the posterior of ${\bm{\pi}}$ as follows:

p_{\theta}(\bm{l}|\bm{x})=\sum_{\bm{\pi}\in\mathcal{B}^{-1}(\bm{l})}p_{\theta}(\bm{\pi}|\bm{x})

(3)

where $\mathcal{B}$ is a function, mapping state sequences into label sequences by removing consecutive repetitive labels and blanks [2]. Notably, Eq. (2) assumes that the states between all time steps are conditionally independent. To overcome such unreasonable assumption, CTC-CRF extends CTC and redefine the posterior of ${\bm{\pi}}$ as a CRF:

p_{\theta}(\bm{\pi}|\bm{x})=\frac{\exp(\phi_{\theta}(\bm{\pi},\bm{x}))}{\sum_{\bm{\pi^{\prime}}}\exp(\phi_{\theta}(\bm{\pi^{\prime}},\bm{x}))}

(4)

Here $\phi_{\theta}(\bm{\pi},\bm{x})$ denotes the potential function of the CRF, defined as:

\phi_{\theta}(\bm{\pi},\bm{x})=\sum_{t=1}^{T}\log p_{\theta}(\pi_{t}|\bm{x})+\log p(\bm{l})

(5)

where $\bm{l}=\mathcal{B}({\bm{\pi}})$ . The sum $\sum_{t=1}^{T}\log p_{\theta}(\pi_{t}|\bm{x})$ defines the node potential, calculated from the bottom DNN. $\log p(\bm{l})$ defines the edge potential, realized by an n-gram LM of labels which is often referred to as the denominator n-gram LM. Combining Eq. (3)(4)(5) yields the CTC-CRF loss function:

\displaystyle\mathcal{L}(\theta)

\displaystyle=-\log\frac{\sum_{\bm{\pi}\in\mathcal{B}^{-1}(\bm{l})}\exp(\phi_{\theta}(\bm{\pi},\bm{x}))}{\sum_{\bm{\pi^{\prime}}}\exp(\phi_{\theta}(\bm{\pi^{\prime}},\bm{x}))}

(6)

By incorporating $\log p(\bm{l})$ into the potential function in CTC-CRF, the conditional independence drawback suffered by CTC is naturally avoided. It has been shown that CTC-CRF outperforms regular CTC consistently on a wide range of benchmarks, and is on par with other state-of-the-art end-to-end models [8, 9, 10, 16].

4 Training recipe

In this section, we discuss the techniques applied in Conformer AM training, where the whole training pipeline is illustrated in Fig. 2. It can be seen from Fig. 2 that such pipeline consists of processing steps from two main perspectives:

Data preparation

process is very similar to that of [9], including the preparation of input features and subword tokenization for AM respectively. However, this work differs from [9] at how we prepare these two parts of data. For input features, we extract 80-dimensional Fbank features from a 25ms window with stride of 10ms and normalize the features with cepstral mean and variance normalization (CMVN). Unless otherwise stated, 3-way speed perturbation and [30] and SpecAug [31] are adopted as data augmentation. We will present how the new input features influence recognition performance in Section 5.4. Details of subword tokenization will be present in Section 4.1. We also implement our new SpecAug, which will be introduced in Section 4.2 and analyzed in Section 5.4.

AM training

is conducted with Confromer neural networks coupled with the CTC-CRF loss. The speech features are first fed into a 1/4 convolution subsampling layer with or without SpecAug, then a linear layer projects the features to the same dimension as the hidden size of Conformer blocks, and then feeds the features into a series of sequentially stacking Conformer blocks. To better monitor the training of Conformers, we introduce our learning rate scheduler (see Section 4.3). Following [12], we also examine the influence of Conformer size on the recognition performance in Section 5.4.

4.1 Subword tokenization

For subword tokenization, we implement three systems with modeling units of phones, wordpieces and characters respectively. In the phone-based systems, pronunciation lexicons will be adopted to tokenize word-level transcripts if they are available, otherwise Phonetisaurus Grapheme-to-Phoneme (G2P) [32] will be used. For wordpiece and character based systems, we use SentencePiece [33] for tokenization. The tokenization processes for character- and wordpiece-based systems are slightly different from the phone-based systems, as the two systems do not generate the actual phonetic lexicons, they just simply define a mapping rule from word to characters or wordpieces.

For wordpiece-based systems, we first train a tokenization model with the unigram mode and the wordpiece size of 150. Wordpieces occured with high frequency in training texts will be reserved while rare ones will be mapped into <unk>. We find such mapping is very crucial for AM training. Additionally, <s> and </s> are excluded since they are not involved in AM training. Consequently, there are 148 wordpieces generated from the trained tokenization model. Then, we utilize the trained tokenization model to encode word-based transcripts into wordpiece ID. For word-to-wordpiece mapping, we tokenize the words occurred in the training set and collect the pairs of word and wordpiece ID. We map words into wordpiece IDs rather than plain wordpiece text, because some characters in the corpus need normalization, otherwise they can not be handled properly. In short, we utilize SentencePiece [33] to map words into wordpiece IDs for AM training, and revert wordpiece IDs to words using the word-to-wordpieceID mapping rule at the stage of decoding. The data preparation procedure of character-based systems is the same as in wordpiece-based systems. The only difference is that the tokenization mode should be set to char when using SentencePiece for tokenization.

Despite the broad applications of wordpieces as modeling units for ASR systems, there seems to be no consensus on how to set the appropriate size of the wordpiece set, which is usually determined experimentally [11, 7]. When the size of the wordpiece set increases, the denominator n-gram LM (see Eq. (5)) will be enlarged, which will increases the training cost. Thus, we fix the size of the wordpiece set to be 150 throughout all our experiments in this work.

4.2 SpecAug with ratio

SpecAug [31] has shown its power in modern end-to-end ASR systems [12, 34]. The basic idea of SpecAug is to randomly mask some part of the input features along time and frequency dimension respectively and warp the time steps within a given range. The original SpecAug [31] does not take the length of the features into account, while recent works [12, 35] show that when SpecAug considers ratios with respect to the the sequence length in masking and warping, it performs better than the original one. Thus, in this work, we implement our own ratio SpecAug and discuss it in Section 5.4.3.

4.3 Learning rate scheduler

The learning rate scheduler of Conformer [12], which is borrowed from Transformer training [36], includes a linearly warmup and a decreasing decay by the inverse square root to the steps, defined as:

lr(n)=d_{model}^{-0.5}\cdot\min(n^{-0.5},n\cdot N_{warmup}^{-1.5})

(7)

where $n$ and $N_{warmup}$ indicate the number of training steps and warmup steps respectively, and $d_{model}$ indicates the hidden dimension of Conformer blocks. With the Transformer scheduler, training usually terminates when reaching the predefined steps. To add more flexibility to control the peak warmup learning rate, we introduce a factor $p$ to multiply with $d_{model}^{-0.5}$ , which is set 0.5 for Librispeech (Section 5.2) while 1.0 for Switchboard (Section 5.1) and CommonVoice German (Section 5.3) respectively. We also introduce an early stop mechanism to avoid overfitting, where learning rate will decay with a factor of 0.3 if the loss does not decrease on the validation set. And training will terminate when the learning rate is less than a given threshold.

5 Experiments

We evaluate Conformer AMs with different modeling units on three datasets - the 260-hour Switchboard [37], the 1000-hour Librispeech [38] and the 700-hour Mozilla CommonVoice 5.1 German (CV German) [39]. For Conformer based AM training, we experiment with three configurations in all our experiments and discuss their effects in Section 5.4.2. To further improve the performance, we also apply Kaldi’s latest rescoring scripts [40] with word-level Transformer LMs to rescore the N-best lists generated from the first-pass 4-gram based decoding.

5.1 Switchboard

The techniques in Section 4 are applied on the conversational Switchboard dataset. Following [40], we train Transformer LMs with 6 hidden layers, using the speech transcripts and English Fisher corpus, where there are a total of 34M tokens in the training set. The sizes of vocabularies used for LM training differ slightly between phone-based and wordpiece-based systems, but both are roughly 30K. Such difference originates from the the construction of lexicon as described in Section 4.1. Accordingly, we train Transformer LMs separately for the two systems with the same configuration. The results on Switchboard are shown in Table 1.

Table 1: WER results (%) of different systems on the Eval2000 test set. “SW” and “CH” are short for Switchboard and Callhome subsets of Eval2000 respectively. “Trans.” is short for Transformer neural networks. “#Params” counts the parameters of AM in million. The ‘+’ in the second column indicates doing interpolation of n-gram LMs and NN LMs. The results in square brackets denote the weighted average over SW and CH based on our calculation when not reported in the original paper.

Method		Unit	#Params	SW	CH	Eval2000
RNN-T
BLSTM [34]	RNN LM	char	57	6.6	13.9	10.3
+ i-vectors	RNN LM	char	57	6.4	13.4	9.9
CTC/Attention
Conformer ¹¹1Results of Conformer with 2k bpe from https://github.com/espnet/espnet/tree/master/egs/swbd/asr1.	Trans.	bpe	44.6	6.8	14.0	10.4
LF-MMI
TDNN-F [40]	4-gram	triphone	-	8.6	17.0	12.8
	+Trans. ^$\ast$			7.2	14.4	10.8
	+Trans. ^$\ast\ast$			6.5	13.9	10.2
TDNN-LSTM [41]	4-gram	biphone	-	9.8	19.3	[14.6]
TDNN-LSTM [41]	+RNN LM	biphone	-	8.5	17.4	[13.0]
CTC-CRF
VGGBLSTM [9]	4-gram	monophone	39.15	9.8	18.8	14.3
VGGBLSTM [9]	+RNN LM	monophone	39.15	8.8	17.4	13.1
Conformer (This work)	4-gram	monophone	51.82	7.9	16.1	12.1
	+Trans.	monophone	51.82	6.9	14.5	10.7
	4-gram	wordpiece	51.85	8.7	16.5	12.7
	+Trans.	wordpiece	51.85	7.2	14.8	11.1

$\ast$

N-best rescoring.
$\ast\ast$

Iterative lattice rescoring.

As Table 1 shows, for monophone CTC-CRFs, using Conformer in this work significantly outperform the prior work of using VGG-BLSTM [9] with or without NN LM rescoring. Specifically, by using Conformer together with Transformer LM rescoring, this work reduces the WER by 18.32% (10.7% vs 13.1%) against the prior best CTC-CRF system, which uses VGG-BLSTM and RNN LM rescoring in [9]. The wordpiece-based CTC-CRF performs very close to monophone CTC-CRF (11.1% vs 10.7%), both using Conformers. Considering that English is a language with a low degree of grapheme-phoneme correspondence, this performance gap is not surprising. Nevertheless, it can be seen that the performance gap between monophone CTC-CTF and wordpiece CTC-CRF becomes smaller, when compared to the gap between monophone CTC-CRF and mono-char CTC-CRF as shown in [8]. This confirms the advantage of using wordpiece units over using characters. Notably, CTC-CRF is similar to LF-MMI [42], so we also compare the monophone CTC-CRF system with a triphone LF-MMI system as in [40], and both use the same Transformer LM rescoring scripts. CTC-CRF achieves slightly better results (10.7% vs 10.8%) with N-best rescoring, despite LF-MMI is further improved by iterative lattice rescoring, which, however, doubles the computation cost compared to N-best rescoring. We find that for our CTC-CRF based systems, iterative lattice rescoring hardly improve over N-best rescoring. Our systems also exhibit competitive performance when compared to other top performing systems from the literature on the Switchboard dataset.

5.2 Librispeech

The Librispeech dataset includes 960-hour speech data, among which we split 95% to train the AM and use the rest as the validation set. The official test-clean, test-other, dev-clean and dev-other datasets are for evaluation and excluded from training. In the Librispeech experiments, we only apply our implementation of SpecAug and do not use 3-fold speed perturbation.

We use the openly available 42-layer Transformer LM trained by RWTH [43]. The Transformer LM was trained on the speech transcripts plus an additional 800M-word text-only corpus. The vocabulary size is around 200k. We plug the LM scores calculated from RWTH’s LM into Kaldi’s Transformer LM rescoring scripts [40] to obtain the N-best rescoring results. We present the results on Librispeech and compare our results against those from different systems in the literature in Table 2.

Table 2: WER results (%) of different systems on Librispeech.

Method		Unit	#Params	test
Method		Unit	#Params	clean	other
RNN-T
Conformer [12]		wordpiece	118.8	1.9	3.9
CTC
vggTrans. [11]	4-gram	wordpiece	81	2.31	4.79
vggTrans. [11]	+ Trans.	wordpiece	81	2.10	4.20
Hybrid
Multistream CNN [15]	4-gram	triphone	20	2.80	7.06
	+ RNN LM			2.34	6.04
	+ SRU LM			1.75	4.46
BLSTM [43]	4-gram	triphone	-	3.8	8.8
BLSTM [43]	+Trans.	triphone	-	2.5	5.7
CTC-CRF
BLSTM [8]	4-gram	monophone	13	4.09	10.65
Conformer (This work)	4-gram	monophone	51.82	3.61	8.10
	+ Trans.	monophone	51.82	2.51	5.95
	4-gram	wordpiece	51.85	3.59	8.37
	+ Trans.	wordpiece	51.85	2.54	6.33

Generally, the main observations from Table 2 over the Librispeech dataset are the same as those from Table 1 over the Switchboard dataset. Within the CTC-CRF framework, Conformer neural networks bring significant performance gain, and phone-based and wordpiece-based systems perform equally strong. Compared to a strong hybrid system [43] which uses the same Transformer LM, the CTC-CRF systems achieve lower WER before NN LM rescoring, and the WERs are very close after applying NN LM rescoring. When compared with a recent top performing system [15], our phone-based systems perform slightly better on test other (5.95% vs 6.04%) when applying NN LM rescoring for only one time. When compared with [12, 11], the current CTC-CRF models are much smaller and could be further improved.

5.3 Mozilla CommonVoice German

Mozilla CommonVoice is a crowdsourcing, open source dataset including multiple languages. Our experiments are conducted on the CommonVoice 5.1 German dataset, which consists of around 700-hour speech data and paired text. We first conduct the monophone-based experiments with the original data split, but find that the loss over the development set abnormally converges after only 2 epochs of training. Thus, we combine the train and development data and then resplit for training and validation. We apply such data split the same for both character-based and wordpiece-based systems.

The Transformer LMs are trained over the speech transcripts, no extra texts are used, and the hyper-parameters are the same as those in Section 5.1. There are about 13M tokens in the training set, and the vocabulary sizes are roughly 157K for all the three systems, based on monophones, characters, and wordpieces respectively. Notably, wordpiece-based and character-based systems use the same vocabulary, which differs from that in the phone-based system. We train two Transformer LMs separetely with the same configuration to conduct N-best rescoring.

Table 3: WER results (%) of different systems on CommonVoice German.

Method		Unit	#Params	WER(%)
CTC/Attention
Transformer ²²2	RNN LM	bpe	27.42	10.8
CTC-CRF
Conformer (This work)	4-gram	char	25.03	12.7
	+Trans.	char	25.03	11.6
	4-gram	monophone	25.03	10.7
	+Trans.	monophone	25.03	10.0
	4-gram	wordpiece	25.06	10.5
	+Trans.	wordpiece	25.06	9.8

²²footnotetext: Transformer results from https://github.com/espnet/espnet/tree/master/egs/commonvoice/asr1

As shown in Table 3, the wordpiece-based CTC-CRF system performs better than the wordpiece-based CTC/Attention system from ESPnet with 9.3% relative WER reduction. Further, it can be seen that the performance gap between wordpiece-based and phone-based CTC-CRFs is very close on the CV German dataset. And the wordpiece-based CTC-CRFs even achieve slightly better results compared to phone-based CTC-CRFs. Such observation on the comparison of phone-based and wordpiece-based systems is different from Section 5.1 and Section 5.2, where phone-based systems outperform wordpiece-based systems. An important factor may be the language difference. English has a low degree of grapheme-phoneme correspondence, while such degree of correspondence is high for German. Compared to the wordpiece-based systems, the character-based systems perform worse. This again confirms the advantage of using wordpieces over using characters as the units for ASR.

5.4 Ablation studies

Our ablation experiments are conducted on the Switchboard dataset, basically following the settings in Section 5.1. Different methods are evaluated under the phone-based systems and the results are from the first-pass 4-gram decoding.

5.4.1 AM architecture

We compare BLSTM, VGGBLSTM with the Conformer architectures in the CTC-CRF systems under the same experimental settings and the same data preparation manner as in [8, 9]. We reduce the hidden size of the Conformer to 128, making the model size comparable to that of the BLSTM in [8]. As Table 4 shows, compared to BLSTM, Conformer with a similar model-size obtains 5.3% relative WER reduction. Notably, the Conformer even performs slightly better than the VGGBLSTM that contains more than 3 times of parameters. These results clearly show the advantage of Conformer over BLSTM and VGGBLSTM.

Table 4: WER results (%) of the phone-based systems with different AMs on Switchboard. For fair comparison, 40 Fbank features with

\Delta

and

\Delta\Delta

are used as the input features for Conformer with no data augmentation. All results are from 4-gram LM based decoding.

Method	AM model	#Params (M)	SW	CH	Eval2000
CTC-CRF	BLSTM [8]	13.47	10.3	19.7	[15]
	VGGBLSTM [9]	39.15	9.8	18.8	14.3
	Conformer	12.53	9.2	19.2	14.2

5.4.2 Conformer model size

We use three Conformer models with 12M, 25M and 51M parameters respectively, denoted by Conformer-S+ (small plus), Conformer-M (medium) and Conformer-M+ (medium plus), which, for clarity, are different from the notations in [12]. The Conformer-S+ has 16 blocks with hidden size of 180, 4 attention heads and convolution kernel size of 32. As for Conformer-M and Conformer-M+, the corresponding hyper-parameters are (16, 256, 4, 32) and (17, 360, 8, 32). The ablation experiments demonstrate that increasing the model sizes clearly brings performance improvements.

Table 5: WER results (%) of the phone-based systems with different Conformer sizes on Switchboard.

Model	#Params	SW	CH	Eval2000
Conformer-S+	12.81	8.3	17.1	12.7
Conformer-M	25.03	8.2	16.4	12.3
Conformer-M+	51.82	7.9	16.1	12.1

5.4.3 SpecAug

The effects of SpecAug are analyzed with the Conformer-S+ model. The hyper-parameters of ratio SpecAug are $W=0.2,F=0.15,m_{F}=2,T=0.05,m_{T}=2$ , whose meanings are the same as those in [31] but $W,F,T$ are denoted in proportions here. All our experiments in this work use the same policy for ratio SpecAug. It can be seen from Table 6 that using SpecAug yields 7.1% and 9.3% relative WER reductions on SW and CH respectively, and our ratio SpecAug performs slightly better than the frame-wise-fixed SpecAug in [31].

Table 6: WER results (%) for the Conformer-S+ based CTC-CRFs on Switchboard, when without SpecAug, with the SM SpecAug, and with our ratio SpecAug.

Model	Diff setting	SW	CH	Eval2000
Conformer-S+	w/o SpecAug	9.1	18.9	14.0
	w/ SM SpecAug¹	8.7	17.1	13.0
	w/ ratio SpecAug	8.3	17.1	12.7

1

SM is one of SpecAug hyper-parameter sets for Switchboard dataset in [31].

5.4.4 Input features

Previous works on CTC-CRF [8, 9] use the 120 dimensional input features, which consist of 40 Fbank with delta and delta-delta features, which are fed to BLSTM [8] or VGGBLSTM [9]. In this experiment, we follow the same way as VGGBLSTM based CTC-CRF [9] to take the 120 dimensional input features as 3-channel features instead of flat 120-dimensional ones as used by BLSTM, and feed into the bottom convolution layer. It can be seen from Table 7 that using the delta and delta-delta features has marginal effect on WER results. Increasing the features dimension to 80 is beneficial to the performance.

Table 7: WER results (%) for the Conformer-S+ based CTC-CRFs on Switchboard, with different input features.

Model	Feature	SW	CH	Eval2000
Conformer-S+	40 Fbank+ $\Delta$ + $\Delta\Delta$	9.1	17.7	13.4
	40 Fbank	8.8	17.7	13.3
	80 Fbank	8.3	17.1	12.7

6 Conclusion

In this paper, we successfully advance CTC-CRF based ASR techniques with wordpiece modeling units and Conformer neural networks. Several other techniques, including input features, data augmentation and model size, are also thoroughly examined. We find that (i) Conformer can improve the performance significantly and (ii) wordpiece-based systems perform slightly worse than phone-based systems on the two English datasets (Switchobard and Librispeech), while the two systems perform equally strong on the German dataset (CommonVoice). Notably, English and German are two representative languages which have low and high degrees of grapheme-phoneme correspondence respectively. Our results provide good implication for unit selection in ASR.

Overall, this work demonstrates the potential of the CTC-CRF framework for ASR, which can absorb new neural architectures and achieve state-of-the-art results with or without a pronunciation lexicon. The code will be available at the open-source CAT toolkit [9]³³3https://github.com/thu-spmi/CAT for reproducing the results.

7 Acknowledgements

We would like to thank Xiaohui Zhang and Ke Li for helpful discussions.

References

[1] Hossein Hadian, Hossein Sameti, Daniel Povey, and Sanjeev Khudanpur, “End-to-end speech recognition using lattice-free MMI.,” in INTERSPEECH, 2018, pp. 12–16.
[2] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 369–376.
[3] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, et al., “Deep speech 2: end-to-end speech recognition in english and mandarin,” in Proceedings of the 33rd International Conference on Machine Learning, 2016, vol. 48, pp. 173–182.
[4] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: First results,” arXiv preprint arXiv:1412.1602, 2014.
[5] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, vol. 28, pp. 577–585.
[6] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[7] Kanishka Rao, Hasim Sak, and Rohit Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 193–199.
[8] Hongyu Xiang and Zhijian Ou, “CRF-based single-stage acoustic modeling with CTC topology,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5676–5680.
[9] Keyu An, Hongyu Xiang, and Zhijian Ou, “CAT: A CTC-CRF based ASR toolkit bridging the hybrid and the end-to-end approaches towards data efficiency and low latency.,” in INTERSPEECH, 2020, pp. 566–570.
[10] Huahuan Zheng, Keyu An, and Zhijian Ou, “Efficient neural architecture search for end-to-end speech recognition via straight-through gradients,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 60–67.
[11] Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, and Geoffrey Zweig, “Faster, simpler and more accurate hybrid ASR systems using wordpieces.,” in INTERSPEECH, 2020, pp. 976–980.
[12] Anmol Gulati, James Qin, Chung-Cheng Chiu, et al., “Conformer: Convolution-augmented Transformer for speech recognition,” in INTERSPEECH, 2020, pp. 5036–5040.
[13] SJ Young, J Jansen, JJ Odell, DG Ollason, and PC Woodland, “The HTK book,” 1995.
[14] G. Hinton, Li Deng, Dong Yu, G. E. Dahl, A. Mohamed, N. Jaitly, Andrew Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[15] Kyu J. Han, Jing Pan, Venkata Krishna Naveen Tadala, Tao Ma, and Dan Povey, “Multistream CNN for robust acoustic modeling,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[16] Keyu An, Yi Zhang, and Zhijian Ou, “Deformable TDNN with adaptive receptive fields for speech recognition.,” arXiv preprint arXiv:2104.14791, 2021.
[17] Wei Zhou, Simon Berger, Ralf Schluter, and Hermann Ney, “Phoneme based neural transducer for large vocabulary speech recognition,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[18] Kanthak and Ney, “Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition,” in 2002 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2002, vol. 1, pp. 845–845.
[19] Mirjam Killer, Sebastian Stuker, and Tanja Schultz, “Grapheme based speech recognition,” in Proceedings of the 8th European Conference on Speech Communication and Technology, 2003.
[20] Duc Le, Xiaohui Zhang, Weiyi Zheng, Christian Fugen, Geoffrey Zweig, and Michael L. Seltzer, “From senones to chenones: Tied context-dependent graphemes for hybrid speech recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 457–464.
[21] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, vol. 1, pp. 1715–1725.
[22] Yonghui Wu, Mike Schuster, Zhifeng Chen, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
[23] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649.
[24] Shigeki Karita, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, and Ryuichi Yamamoto, “A comparative study on Transformer vs RNN in speech applications,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 449–456.
[25] Yongqiang Wang, Abdelrahman Mohamed, Due Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, and Michael L. Seltzer, “Transformer-based acoustic modeling for hybrid speech recognition,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6874–6878.
[26] Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar, “Transformer transducer: A streamable speech recognition model with Transformer encoders and RNN-T loss,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7829–7833.
[27] Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, “Convolutional neural networks for speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533–1545, 2014.
[28] Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, and Ravi Teja Gadde, “Jasper: An end-to-end convolutional neural acoustic model,” in INTERSPEECH, 2019, pp. 71–75.
[29] Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, et al., “Recent developments on ESPnet toolkit boosted by Conformer,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5874–5878.
[30] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Audio augmentation for speech recognition.,” in INTERSPEECH, 2015, pp. 3586–3589.
[31] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk, and Quoc V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition.,” in INTERSPEECH, 2019, pp. 2613–2617.
[32] Josef R. Novak, Nobuaki Minematsu, and Keikichi Hirose, “WFST-based grapheme-to-phoneme conversion: Open source tools for alignment, model-building and decoding,” in Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, 2012, pp. 45–49.
[33] Taku Kudo and John Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 66–71.
[34] George Saon, Zoltan Tuske, Daniel Bolanos, and Brian Kingsbury, “Advancing RNN transducer technology for speech recognition,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[35] Daniel S. Park, Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, Quoc V. Le, and Yonghui Wu, “SpecAugment on large scale datasets,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6879–6883.
[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, vol. 30, pp. 5998–6008.
[37] J.J. Godfrey, E.C. Holliman, and J. McDaniel, “Switchboard: telephone speech corpus for research and development,” in Proceedings of 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1992, vol. 1, pp. 517–520.
[38] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
[39] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber, “Common Voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Language Resources and Evaluation Conference, 2019, pp. 4218–4222.
[40] Ke Li, Daniel Povey, and Sanjeev Khudanpur, “A parallelizable lattice rescoring strategy with neural language models,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[41] Hossein Hadian, Hossein Sameti, Daniel Povey, and Sanjeev Khudanpur, “Flat-start single-stage discriminatively trained HMM-based models for ASR,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 1949–1961, 2018.
[42] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI.,” in INTERSPEECH, 2016, pp. 2751–2755.
[43] Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Language modeling with deep Transformers,” in INTERSPEECH, 2019, pp. 3905–3909.