End-to-End Automatic Speech Recognition
Integrated With CTC-Based Voice Activity Detection

Abstract

This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based on a greedy search with the CTC (pre-)softmax output. This prediction includes consecutive long blank labels, which can be regarded as a non-speech region. We use the labels as a cue for detecting speech segments with simple thresholding. The threshold value is directly related to the length of a non-speech region, which is more intuitive and easier to control than conventional VAD hyperparameters. Experimental results on unsegmented data show that the proposed method outperformed the baseline methods using the conventional energy-based and neural-network-based VAD methods and achieved an RTF less than $0.2$ . The proposed method is publicly available.¹¹1\urlhttps://github.com/espnet/espnet

Index Terms— Speech recognition, end-to-end, voice activity detection, streaming, CTC greedy search

1 Introduction

End-to-end automatic speech recognition (E2E-ASR) has been investigated intensively. It is a direct mapping from a sequence of acoustic feature vectors into a sequence of graphemes, resulting in eliminating the need for building components requiring expert knowledge in conventional ASR systems, such as morphological analyzers and pronunciation dictionaries. There are two main types of network architectures for E2E-ASR: connectionist temporal classification (CTC) [1, 2] and attention mechanism [3, 4]. A joint framework of CTC- and attention-based architectures [5, 6] has been recently proposed to exploit the advantages of these two architectures and the effectiveness has been shown in various contexts [7, 8, 9, 10, 11]. Although the performance of E2E-ASR improves gradually, most of the approaches rely on the assumption that the input audio recording is appropriately segmented into short audio pieces. The assumption makes it difficult to use E2E-ASR systems for real-time speech recognition and transcribing unsegmented long audio archives.

There are several attempts to solve the problem. Modified attention-based methods [12, 13, 14, 15, 16] restrict the length of the attention window by assuming that the output label sequence is conditioned on a partially observed feature vector sequence rather than the entire feature vector sequence. Recurrent neural network transducer (RNN-T) [17], which is an extended version of CTC, has also been applied to streaming ASR [18, 19] with uni-directional long short-term memory (LSTM) cells. Although their effectiveness has been shown, voice activity detection (VAD) [20, 21], which detects the speech segment given an audio sequence, could help to improve the recognition accuracy by discarding input segments that do not contain speech and to reduce the computational cost for processing non-speech segments.

The simplest VAD scheme is energy-based: if the short-time energy in a frame of the input audio exceeds a pre-determined threshold, the frame is associated with a voice segment. Statistical model-based VAD scheme has also been proposed. In Sohn et al.’s study [22], hidden Markov models (HMMs) are used to evaluate the likelihood of speech at every frame. A Gaussian mixture model (GMM) is used as a classifier that distinguishes voice categories from non-voice categories [23]. Deep neural network (DNN)-based VAD [24, 25, 26] outperformed conventional VAD schemes recently. VAD is less dependent on language and does not usually require expert knowledge. However, building an extra voice activity detector for ASR requires an additional effort. Furthermore, careful tuning hyperparameters for VAD such as the value of speech/non-speech threshold is often required to maintain a sufficient-level of detection accuracy. Integrating VAD with ASR in a unified E2E framework is a promising way to eliminate these problems. There have been some attempts to integrate VAD with conventional HMM-based ASR [27, 28], but the extensions to E2E-ASR has not been fully investigated.

This paper attempts to integrate VAD with ASR in an E2E manner. The key idea behind the proposed method is using the CTC architecture. CTC can perform frame-by-frame label prediction with low computational cost by performing a greedy search. This input-synchronized label sequence, which cannot be easily obtained in the attention architecture, contains useful information for VAD. The proposed method uses the number of continuous blank labels as a threshold for detecting speech. Since CTC does not always activate on the actual timing of phoneme transition in speech [1], two additional margins are introduced to avoid inappropriate segmentation caused by the behavior. Although the proposed method is based on heuristic thresholding, the threshold value is directly related to the length of a non-speech region. Therefore, it is quite intuitive and easy to control compared with hyperparameters in conventional VAD. Furthermore, the proposed method does not require training of an external voice activity detector, a complex training process to achieve a reasonable performance [29], and high computational cost at inference thanks to the simple strategy. To support both offline and online speech recognition, we propose two types of inference algorithms using bi-directional and uni-directional architectures. Note that the proposed method can be integrated with the powerful attention-based encoder-decoder within the hybrid CTC/attention framework [5, 6] and we show the effectiveness of the proposed method based on the framework.

2 Integrating voice activity detection with automatic speech recognition

2.1 CTC-based voice activity detection

E2E-ASR inference is generally defined as a problem to find the most probable grapheme sequence $\hat{C}$ given the audio input $X$ :

\hat{C}=\operatorname*{argmax}_{C}\,p\left(C\,|\,X\right),

(1)

where $X=(\mathbf{x}_{1},\ldots,\mathbf{x}_{T})$ is a $T$ -length speech feature sequence and $C=(c_{1},c_{2},\ldots)$ is a sequence of grapheme symbols such as alphabetical letters and Kanji characters. To handle very long audio recordings, we assume that $\hat{C}$ consists of the most probable partial grapheme sequences $\{\hat{C}^{(i)}\}_{i=1}^{N}$ given the non-overlapping partial audio subsequences $\{X^{(i)}\}_{i=1}^{N}$ :

\hat{C}\simeq\operatorname*{argmax}_{C}\,\prod_{i=1}^{N}p\left(C^{(i)}\,\Big{|}\,X^{(i)}\right),

(2)

where $N$ is the number of speech segments. The problem of VAD is how to segment $X$ into $X^{(1)},\,\ldots,\,X^{(N)}$ .

In this paper, the label sequence obtained by performing frame-by-frame prediction based on a greedy search with the CTC softmax output is used as a cue to solve the problem. The CTC formulation is started from the posterior probability $p\left(C\,|\,X\right)$ introduced in Eq. (1), which is factorized based on the conditional independence assumption as follows:

p\left(C\,|\,X\right)=\!\!\sum_{Z\in\mathcal{Z}(C)}\!\!p\left(Z\,|\,X\right)\,\approx\!\!\sum_{Z\in\mathcal{Z}(C)}\prod_{t\in\mathcal{T}}p\left(z_{t}\,|\,X\right),

(3)

where $Z$ is a $K$ -length CTC state sequence composed of the original grapheme set and the additional blank label, $\mathcal{Z}(C)$ is a set of all possible sequences given the grapheme sequence $C$ , and $\mathcal{T}=\{r,2r,3r,\ldots\}$ is a set of subsampled time indices with the subsampling factor $r$ $(K\times r\simeq T)$ . We focus on the frame-level posterior distribution $p\left(z_{t}\,|\,X\right)$ in Eq. (3) because it provides frame-synchronous information obtained in the early stage of the inference process. With the greedy search algorithm, the following frame-synchronous label output can be obtained without performing the time-consuming beam search algorithm:

\displaystyle\hat{z}_{t}=\operatorname*{argmax}_{z_{t}}\,p\left(z_{t}\,|\,X\right).

(4)

This operation corresponds to simply taking the label with the highest probability in the categorical distribution of the CTC softmax output at each time step.²²2The output before applying softmax function, which is called the pre-softmax output, can be used instead for saving computational costs. The predicted label $\hat{z}_{t}$ is allowed to be the blank label that often represents non-speech segments in addition to the repetition of the grapheme symbol, i.e., the output label sequence of CTC is expected to contain helpful information for VAD.

Figure 1 shows an overview of the proposed method. As shown in the figure, if the number of consecutive blank labels exceeds a pre-determined threshold $V$ (which is referred as to minimum blank duration threshold) given a sequence of $\hat{z}_{t}$ , the input audio is segmented based on the position of non-blank labels:

X^{(i)}=\left(\begin{array}[]{cccc}\mathbf{x}_{t_{s}^{(i)}},&\mathbf{x}_{t_{s}^{(i)}+1},&\ldots,&\mathbf{x}_{t_{e}^{(i)}}\end{array}\right),

(5)

where $t_{s}^{(i)}$ denotes an index corresponding to the first non-blank label after the $i-1^{\mathrm{th}}$ segment, and $t_{e}^{(i)}$ denotes an index corresponding to the last non-blank label before consecutive blank labels. Since there is no guarantee that the positions of non-blank labels $t_{s}^{(i)}$ and $t_{e}^{(i)}$ correspond to the actual beginning/end of phonemes due to the properties of CTC, we introduce two margins as

X^{(i)}=\left(\begin{array}[]{cccc}\mathbf{x}_{t_{s}^{(i)}-rm_{s}},&\ldots,&\mathbf{x}_{t_{e}^{(i)}+rm_{e}}\end{array}\right),

(6)

where $m_{s}$ and $m_{e}$ are referred to as onset margin and offset margin, respectively, and $r$ is the subsampling factor as described before.

Refer to caption — Fig. 1: Example of detected voice segments in the proposed method where subsampling factor $r=2$ , minimum blank duration threshold $V=4$ , onset margin $m_{s}=1$ , and offset margin $m_{e}=2$ . The shaded acoustic feature vectors will be used for predicting a final label sequence.

2.2 Offline and online inference

Bi-directional LSTM cells rather than uni-directional ones are typically used in the CTC encoder. The inference algorithm of the proposed method using the bi-directional encoder consists of two stages: in the first stage, a mixed blank and non-blank label sequence is obtained by the greedy search given the entire audio input sequence, and in the second stage, a final label sequence is obtained by feeding the partial audio input sequences extracted by using the mixed label sequence. This inference algorithm is suitable for offline recognition, e.g., transcribing long audio recordings, because of the inevitable delay caused by the two-path decoding. On the other hand, online decoding can be performed by using the uni-directional encoder because there is no need to recompute the encoder states, which differs from the first algorithm. The algorithmic delay mainly depends on the length of the offset margin. This seems to be reasonable for practical use.

Although the proposed method depends on CTC, it can be combined with the attention decoder within the hybrid CTC/attention framework [5]. In this framework, the encoders for both CTC and attention models are shared and trained based on their two objective functions. The decoders for both CTC and attention models can be used for decoding simultaneously, resulting in better recognition performance.

3 Experiments

Two datasets were used to investigate the effectiveness of the proposed method. Note that the datasets consist of spontaneous talks that sometimes exceed 20 minutes in length. Audio segmentation is an essential technique for saving computational resources and complexity. ESPnet [30] was used as an E2E-ASR toolkit through the experiments. The following four methods were compared:

•

Oracle: The audio input of the E2E model was segmented according to the time information provided by the datasets.
•

Base1: The audio input of the E2E model was automatically segmented by the simple energy-based VAD algorithm, which is implemented on the Kaldi toolkit.³³3\urlhttps://github.com/kaldi-asr/kaldi The hyperparameters of the algorithm such as the energy threshold were tuned by a grid search.
•

Base2: The audio input of the E2E model was automatically segmented by data-driven neural-network-based VAD. We built a robust voice activity detector across different corpora trained by large amounts of speech segments with noisy reverberant data augmentation. The network consisted of five time-delay neural network layers and two statistics pooling layers. The number of hidden units per layer was 256. The input of the network was 40-dimensional MFCC with around $800$ ms of the left context and $200$ ms of the right context. Simple Viterbi decoding on an HMM is used to obtain speech activity.
•

Prop: The audio input of the E2E model was automatically segmented based on the proposed method described in the previous section.

The E2E model was mainly based on the hybrid CTC/attention architecture and trained on the segmented data provided by the datasets.

3.1 Japanese dataset

The CSJ corpus [31] was used for our first evaluation. It contains about 650 hours of spontaneous Japanese speech data. The evaluation data is composed of three sets: eval1, eval2, and eval3. The training hyperparameters were set to the default values of the training recipe provided by ESPnet except that the mini-batch size and the number of units were half the default value. Assuming offline recognition, the network structure consisted of a 4-layer bi-directional LSTM encoder and a 1-layer LSTM decoder. The subsampling factor $r$ introduced in Section 2.1 was set to four. The input feature was composed of 80-dimensional mel-scale filter-bank features and 3-dimensional pitch features, which were calculated every 10 ms. The output of the E2E model included Japanese Kanji and Hiragana characters. To avoid obtaining very short recognition results against input audio length due to the attention decoder predicting the early appearance of the end-of-sentence symbol, the recognition candidates satisfying the following equation were rejected:

\frac{\mbox{Length of output label sequence}}{\mbox{Length of subsampled encoded sequence}}\leq\alpha,

(7)

where $\alpha$ was set to $0.1$ in the experiments. This scheme was very important to avoid the effect of the unintended behavior of the attention mechanism. Character error rate (CER) was used as the evaluation metric because CER is widely used for the Japanese ASR evaluation due to its ambiguous word boundary.

3.1.1 Decoding hyperparameters

The initial experiment only used the CTC decoding instead of the joint CTC/attention decoding. We first investigated the effect of the value of the onset margin $m_{s}$ , as introduced in Section 2.1. It was varied from $0$ to $3$ while minimum blank duration threshold $V=16$ and offset margin $m_{e}=2$ . Table 3 summarizes the CERs obtained in the experiment. It can be said from the table that the onset margin is important to achieve correct recognition results. The value of $m_{e}$ was also varied from $0$ to $3$ while $V=16$ and $m_{s}=2$ . The result is shown in Table 3. The performance difference between with and without the offset margin, $m_{e}>0$ and $m_{e}=0$ , was significantly large. These two margins seem to buffer the time-lag of the CTC softmax output.

Table 1: Effect of value of onset margin.

$m_{s}$	eval1 [%]	eval2 [%]	eval3 [%]	Avg. [%]
0	10.4	7.2	8.6	8.7
1	10.1	7.0	8.1	8.4
2	10.1	7.0	8.0	8.4
3	10.1	7.1	8.1	8.4

Table 2: Effect of value of offset margin.

$m_{e}$	eval1 [%]	eval2 [%]	eval3 [%]	Avg. [%]
0	10.7	7.5	8.9	9.0
1	10.2	7.1	8.3	8.5
2	10.1	7.0	8.0	8.4
3	10.0	6.9	7.9	8.3

Table 3: Effect of value of minimum blank duration threshold.

$V$	eval1 [%]	eval2 [%]	eval3 [%]	Avg. [%]
8	11.2	8.2	8.9	9.7
12	10.5	7.3	8.3	8.7
16	10.1	7.0	8.0	8.4
20	10.0	6.8	7.7	8.2
24	9.9	6.7	6.7	7.8

To investigate the sensitivity to the change of the minimum blank duration threshold, the value of $V$ was varied with the fixed onset and offset margins ( $m_{s}=2$ and $m_{e}=2$ ). Table 3 shows the results against all the test sets. From the table, a low CER can be obtained by selecting a large threshold value, as expected. However, reasonable recognition performance could be achieved by a medium threshold value, e.g., $V=16$ . This means that $640~{}(=16\times 4\times 10)$ ms was used as a threshold value. The duration seems to be reasonable for detecting short pause, indicating that the proposed method can intuitively tune VAD parameters in contrast to the conventional VAD methods requiring non-intuitive parameters such as an energy threshold.

3.1.2 Performance comparison

In the next experiment, we compared the baseline methods with the proposed one whose hyperparameters were set to $V=16$ , $m_{s}=2$ , and $m_{e}=3$ . Table 6 shows the results of the experiments using not only the CTC decoder but also the attention decoder and the joint CTC/attention decoder. Base2 obtained a lower CER than Base1. This indicates the effectiveness of the data-driven VAD method. Prop achieved further improvement over Base2. CTC is a mapping from acoustic features to characters whereas the standard data-driven VAD is a mapping from acoustic features to a simple binary symbol indicating speech or non-speech. The performance gain may be because the proposed method was able to detect speech considering linguistic information provided by CTC. Prop obtained the best CER regardless of the decoder type. In the following experiments, the joint decoding using the CTC prefix score was used.

3.2 English dataset

The TED-LIUMv2 corpus [32], which is a set of English TED talks with transcriptions, was used for our second evaluation. It contains about 200 hours of speech data. The training hyperparameters were set to the default values of the training recipe provided by ESPnet. Two types of the network structure were investigated: 6-layer bi-directional LSTM encoder with 1-layer LSTM decoder, and 6-layer uni-directional LSTM encoder with 2-layer LSTM decoder. The output of the E2E model was subword units encoded by byte-pair encoding (BPE) instead of alphabetical letters. A 4-layer LSTM-based language model was trained using the same corpus and was used at the decoding stage. Word error rate (WER) as well as CER were used as the evaluation metric.

3.2.1 Decoding hyperparameters

In the first experiment, the effects of the onset and offset margins were investigated. The minimum blank duration threshold was fixed as $V=16$ , which comes from the result of the previous evaluation. Tables 6 and 6 show the experimental results of the bi-directional and uni-directional cases, respectively. For the bi-directional case, the large offset margin was important whereas the large onset margin was important for the uni-directional case. The appropriate values of the two margins differed from those of the previous evaluation. This may be because of the difference in the type of subword units.

Table 4: Performance comparison on CSJ corpus.

Oracle	7.2	7.7	6.1
	CTC [%]	Attention [%]	Joint [%]
Base1	8.9	9.4	8.6
Base2	8.6	8.9	8.1
Prop	9.2	8.3	7.6

Table 5: Effect of decoding hyperparameters (bi-directional).

$m_{s}$	$m_{e}$	dev [%]	test [%]
4	2	15.2	19.5
4	6	13.2	16.6
4	10	12.2	15.4
2	10	12.3	15.6

Table 6: Effect of decoding hyperparameters (uni-directional).

$m_{s}$	$m_{e}$	dev [%]	test [%]
2	4	20.9	27.0
6	4	16.0	20.4
10	4	14.9	19.0
10	2	14.9	18.9

3.2.2 Performance comparison

In the next experiment, the proposed method was compared with the baseline methods where the hyperparameters of Prop were set to $V=16$ , $m_{s}=4$ , and $m_{e}=10$ for the bi-directional case and $V=16$ , $m_{s}=10$ , and $m_{e}=2$ for the uni-directional case. Tables 9 and 9 summarize the obtained CERs and WERs. The proposed method again outperformed the baseline methods for both the cases, indicating the effectiveness of the proposed method.

3.2.3 Inference speed

In the final experiment, real-time factors (RTFs) of the proposed method with the uni-directional encoder were measured under some conditions. To speed up inference, the vectorized beam search algorithm [33] was used. The margin parameter in the algorithm was set to 50. CPU with four threads or a single-GPU (NVIDIA TITAN V) was used. The obtained RTFs as well as WERs are shown in Table 9. The vectorized beam search improved RTF with small accuracy degradation. As expected, GPU accelerated the inference speed significantly and an RTF of $0.18$ was achieved. This indicates that the proposed method is acceptable for real-time applications.

4 Conclusion

Automatic speech recognition and voice activity detection were integrated in an end-to-end manner. Our experiments showed the effectiveness of the proposed method over conventional voice activity detectors. Future work includes applying the proposed idea to the Transformer, which is a promising model for outperforming the LSTM encoder-decoder model [34].

5 Acknowledgements

This research is supported by the Center of Innovation (COI) program from the Japan Science and Technology Agency (JST). The authors thank Vimal Manohar for kindly providing us with a DNN-based voice activity detector.

Table 7: Performance comparison on TED-LIUM corpus (bi-directional).

	dev CER / WER [%]	test CER / WER [%]
Oracle	9.6 / 11.7	9.4 / 11.2
Base1	11.2 / 13.0	16.6 / 17.6
Base2	10.7 / 12.9	14.9 / 17.2
Prop	10.0 / 12.2	13.3 / 15.4

Table 8: Performance comparison on TED-LIUM corpus (uni-directional).

	dev CER / WER [%]	test CER / WER [%]
Oracle	11.9 / 14.1	13.0 / 14.9
Base1	14.0 / 15.7	20.2 / 20.3
Base2	13.4 / 15.9	18.8 / 20.7
Prop	12.5 / 14.9	16.8 / 18.9

Table 9: Speed improvement of online inference.

Vectorization	GPU	dev [%]	test [%]	RTF
		14.9	18.9	3.77
✓		15.0	19.0	2.93
	✓	14.9	18.9	0.27
✓	✓	15.0	19.0	0.18

References

[1] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” Proceedings of the 23rd ICML, pp. 369–376, 2006.
[2] Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” Proceedings of the 31th ICML, pp. 1764–1772, 2014.
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[4] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: First results,” arXiv preprint arXiv:1412.1602, 2014.
[5] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, and Tomoki Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
[6] Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” Proceedings of Interspeech, pp. 949–953, 2017.
[7] Albert Zeyer, Kazuki Irie, Ralf Schluter, and Hermann Ney, “Improved training of end-to-end attention models for speech recognition,” Proceedings of Interspeech, pp. 7–11, 2018.
[8] Suyoun Kim and Florian Metze, “Dialog-context aware end-to-end speech recognition,” Proceedings of SLT, pp. 434–440, 2018.
[9] Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, and Maja Pantic, “Audio-visual speech recognition with a hybrid CTC/attention architecture,” Proceedings of SLT, pp. 513–520, 2018.
[10] Wangyou Zhang, Xuankai Chang, and Yanmin Qian, “Knowledge distillation for end-to-end monaural multi-talker ASR system,” Proceedings of Interspeech, pp. 2633–2637, 2019.
[11] Karan Malhotra, Shubham Bansal, and Sriram Ganapathy, “Active learning methods for low resource end-to-end speech recognition,” Proceedings of Interspeech, pp. 2215–2219, 2019.
[12] Navdeep Jaitly, Quoc V Le, Oriol Vinyals, Ilya Sutskever, David Sussillo, and Samy Bengio, “An online sequence-to-sequence model using partial conditioning,” Advances in Neural Information Processing Systems 29, pp. 5067–5075, 2016.
[13] Junfeng Hou, Shiliang Zhang, and Lirong Dai, “Gaussian prediction based attention for online end-to-end speech recognition,” Proceedings of Interspeech, pp. 3692–3696, 2017.
[14] Ruchao Fan, Pan Zhou, Wei Chen, Jia Jia, and Gang Liu, “An online attention-based model for speech recognition,” Proceedings of Interspeech, pp. 4390–4394, 2019.
[15] Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Ta Li, and Yonghong Yan, “Online hybrid CTC/attention architecture for end-to-end speech recognition,” Proceedings of Interspeech, pp. 2623–2627, 2019.
[16] Niko Moritz, Takaaki Hori, and Jonathan Le Roux, “Triggered attention for end-to-end speech recognition,” Proceedings of ICASSP, pp. 5666–5670, 2019.
[17] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[18] Kanishka Rao, Hasim Sak, and Rohit Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer,” Proceedings of ASRU, pp. 193–199, 2017.
[19] Yanzhang He et al., “Streaming end-to-end speech recognition for mobile devices,” arXiv preprint arXiv:1811.06621, 2018.
[20] Bishnu S. Atal and Lawrence R. Rabiner, “A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 3, pp. 201–212, 1976.
[21] Javier Ramirez, Juan M. Gorriz, and Jose C. Segura, “Voice activity detection. Fundamentals and speech recognition system robustness,” Robust Speech Recognition and Understanding, 2007.
[22] Jongseo Sohn, Nam Soo Kim, and Wonyong Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, 1999.
[23] Akinobu Lee, Keisuke Nakamura, Ryuichi Nisimura, Hiroshi Saruwatari, and Kiyohiro Shikano, “Noise robust real world spoken dialogue system using GMM based rejection of unintended inputs,” Proceedings of the 8th ICSLP, pp. 173–176, 2004.
[24] Xiao-Lei Zhang and Ji Wu, “Deep belief networks based voice activity detection,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 4, pp. 697–710, 2013.
[25] Neville Ryant, Mark Liberman, and Jiahong Yuan, “Speech activity detection on YouTube using deep neural networks,” Proceedings of Interspeech, pp. 728–731, 2013.
[26] Thad Hughes and Keir Mierle, “Recurrent neural networks for voice activity detection,” Proceedings of ICASSP, pp. 7378–7382, 2013.
[27] Tasuku Oonishi, Paul R. Dixon, Koji Iwano, and Sadaoki Furui, “Robust speech recognition using VAD-measure-embedded decoder,” Proceedings of Interspeech, pp. 2239–2242, 2009.
[28] Kit Thambiratnam, Weiwu Zhu, and Frank Seide, “Voice activity detection using speech recognizer feedback,” Proceedings of Interspeech, pp. 1492–1495, 2012.
[29] Senmao Wang, Pan Zhou, Wei Chen, Jia Jia, and Lei Xie, “Exploring RNN-transducer for Chinese speech recognition,” arXiv preprint arXiv:1811.05097, 2018.
[30] Shinji Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” Proceedings of Interspeech, pp. 2207–2211, 2018.
[31] Kikuo Maekawa, “Corpus of spontaneous Japanese: Its design and evaluation,” Proceedings of SSPR, pp. 7–12, 2003.
[32] Anthony Rousseau, Paul Deleglise, and Yannick Esteve, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” Proceedings of the 9th LREC, 2014.
[33] Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Niko Moritz, and Jonathan Le Roux, “Vectorized beam search for CTC-attention-based speech recognition,” Proceedings of Interspeech, pp. 3825–3829, 2019.
[34] Shigeki Karita et al., “A comparative study on transformer vs RNN in speech applications,” Proceedings of ASRU, 2019.

End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection