The HW-TSC’s Offline Speech Translation Systems for IWSLT 2021 Evaluation

Minghan Wang¹, Yuxia Wang¹, Chang Su¹, Jiaxin Guo¹, Yingtao Zhang¹, Yujia Liu¹,
Min Zhang¹, Shimin Tao¹, Xingshan Zeng², Liangyou Li², Hao Yang¹, Ying Qin¹
¹Huawei Translation Services Center
²Huawei Noah’s Ark Lab
{wangminghan,wangyuxia5,suchang8,guojiaxin1,zhangyingtao9,
liuyujia13,zhangmin186,taoshimin,zeng.xingshan,
liliangyou,yanghao30,qinying}@huawei.com

Abstract

This paper describes our work in participation of the IWSLT-2021 offline speech translation task. Our system was built in a cascade form, including a speaker diarization module, an Automatic Speech Recognition (ASR) module and a Machine Translation (MT) module. We directly use the LIUM SpkDiarization tool as the diarization module. The ASR module is trained with three ASR datasets from different sources, by multi-source training, using a modified Transformer encoder. The MT module is pretrained on the large-scale WMT news translation dataset and fine-tuned on the TED corpus. Our method achieves 24.6 BLEU score on the 2021 test set.

1 Introduction

Speech translation (ST) system aims to translate the speech of source language to the text in target language. There are two types of ST systems: cascade and end-to-end. The cascade system consists of several sequentially-connected sub-systems, where they are trained independently, the output of the first module is the input of the second, and so on. By contrast, end-to-end systems generate the target text directly from the audio signal, without intermediate outputs, in which all parameters are updated jointly (Bentivogli et al., 2021).

In IWSLT-2021 offline speech translation task, participants are welcome to build any one of the two types. Participants can build their system merely on the provided data from the committee which is considered as constrained track, or, adding external publicly available resources, which is considered as unconstrained. The test set is extracted from TED talks — translation from English speech to German text, being similar to the setup of last few years (Anastasopoulos et al., 2021).

Although end-to-end ST system avoids error propagation, we go with cascade system due to its overwhelming advantage of rich training sources. Our system is composed of three modules, including the speaker diarization module — we apply a off-shelf tool LIUM_SpkDiarization (Meignier and Merlin, 2010), the ASR module and the MT module which are trained on our own.

The paper is organized as below: Section 2 introduces the dataset applied in training. Section 3 describes the details of each module. Experimental setup and results are demonstrated in Section 4 with conclusion in the final Section 5.

2 Data

	Corpora	Size	Time (Hr)
ASR	LibriSpeech	281K	960
	MuST-C V2	248K	435
	CoVoST	564K	900
MT	WMT bitext	27.7M	-
	Back translated	521M	-
	TED corpus	209K	-

Table 1: Statistical information of corpus applied for ASR and MT training.

We use the LibriSpeech (Panayotov et al., 2015), MuST-C V2 (Cattoni et al., 2021) and CoVoST (Wang et al., 2020) to train the ASR module. The audio of LibrisSpeech is derived from audio books reading and the text is case-insensitive without punctuation. MuST-C is a multilingual dataset recorded from the TED-talks, we only use the English data for ASR task. CoVoST is also a multilingual speech translation dataset based on Common voice, and the content is open-domain. Both MuST-C and CoVoST have case-sensitive text with punctuation.

To pretrain the MT module, we employ all available bilingual text, as well as the news crawl monolingual text provided in the WMT2019 news translation task (Barrault et al., 2019), followed by the fine-tuning using TED corpus¹¹1https://wit3.fbk.eu/2017-01-c. Table 1 shows the statistics information of the corpus applied.

The development and test sets are provided on the official site including the data ranging from 2010 to 2021. Despite offering the segmented version, we do not leverage them in our setting since we find the quality of that is not enough to produce good transcripts.

3 Method

3.1 Speaker Diarization

The audio file provided in the development and test sets are recordings of complete TED speech. To obtain sentence-level segmentation, LIUM_SpkDiarization is employed with default parameter configuration, followed by the removal of audio samples that is less than one second.

3.2 Automatic Speech Recognition

Our ASR model is built upon Transformer architecture (Vaswani et al., 2017), but in encoder, to better extract features of audio instead of text, we replace the original word embedding layer with two 1-dimensional convolution layers, with kernel size=5 for both (Synnaeve et al., 2019).

We find that the test set of IWSLT — specifically content domain and writing style, is closer to MuST-C compared to another two corpus mentioned in Section 2, but it’s relatively in small scale. So we expect the model to learn common knowledge of ASR that is domain- and style-invariant from distribution-distant LibriSpeech and CovoST, but on the other hand, being able to decode in a specific style learned from MuST-C. To this end, we propose to train the model with the approach of multi-source training:

Multi-Source Training

explicitly provides the data source information as a prior condition to the model, so that the distribution in the text space can be modeled separately. It’s commonly used in multilingual translation (Ha et al., 2016; Johnson et al., 2017), where a language tag is used as the first token of input to the decoder, to steer the model to translate in specific language.

In our ASR model, we use [LS], [MC] and [CV] to represent LibriSpeech, MuST-C and CoVoST. During training, they are used as the first token to substitute the [BOS] token on samples from specific data source, which provides the model with an explicit signal to generate text in the style as the tag. Formally, we define $s\in\mathcal{S}$ as the data source tag and integrate it into the original objective:

p(y_{i}|y<i,s,X)

where $X$ are source tokens, $y_{i}$ is the target token at current step, $y<i$ are previously generated tokens under the autoregressive setting. In this way, the probability to predict the next token is additionally conditioned on the data source.

During inference, we force the model to decode in the style of MuST-C by feeding [MC] as the initial token, since the test set of IWSLT task has the same data source with MuST-C (TED-talks). This method successfully expands the size of training set meanwhile prevents the model from decoding confusingly.

3.3 Machine Translation

For the machine translation model, we strictly follow Ng et al. (2019) to pretrain the model on the WMT 2019 news translation corpus including bilingual text and back translated data from monolingual text. Then we fine-tuned on the TED corpus for domain adaptation.

4 Experiments

SET	BLEU	TER	BEER	CharacTER	BLEU(ci)	TER(ci)
dev2010	26.00	58.85	53.22	48.52	27.56	56.42
tst2010	26.37	58.72	52.21	51.45	27.95	56.27
tst2013	29.89	55.97	53.59	47.80	31.33	53.84
tst2014	28.03	57.56	52.98	48.78	29.17	55.61
tst2015	23.20	74.90	50.59	51.63	24.37	72.84
tst2018	22.13	70.26	51.58	52.44	23.43	67.97
tst2020	25.40	-	-	-	-	-
tst2021	20.3/24.6	-	-	-	-	-

Table 2: The experimental results of our systems. Results from dev2010 to tst2018 are from our own evaluation. Scores of tst2020 and tst2021 are from the official report, where tst2020 uses TEDRef and tst2021 uses TEDRef (left) and NewRef (right)

4.1 Setup

80 dimensional Mel-Filter bank features are extracted from audio files for ASR training corpus. Sentencepiece (Kudo and Richardson, 2018) is utilised for tokenization on ASR texts with a learned vocabulary restricted to 20000 sub-tokens. For the MT datasets, we apply moses tokenizer (Koehn et al., 2007) and BPE (Sennrich et al., 2016) for tokenization.

ASR model is configured as: $n_{\text{encoder\_layers}}$ = 12, $n_{\text{decoder\_layers}}$ = 6, $n_{\text{heads}}$ = 16, $d_{\text{hidden}}$ = 1024, $d_{\text{FFN}}$ = 4096. The NMT model has the standard Transformer-big configuration but with $d_{\text{FFN}}$ set to 8192 (Ng et al., 2019). All models are implemented with fairseq (Ott et al., 2019).

During the training of ASR model, we set the batch size to the maximum of 40,000 frames per card. Inverse sqrt is used for lr scheduling with warm-up steps set to 10,000 and peak lr set as 5e-4. Adam is used as the optimizer. The model is trained on 4 V100 GPUs for 50 epochs. Parameters for last 4 epochs are averaged. All audio inputs are augmented with spectral augmentation (Park et al., 2019) and are normalized with utterance cepstral mean and variance normalization (CMVN).

The pretraining of NMT model strictly follows the work of Ng et al. (2019). We pretrained two models with different parameters randomly initialized. Both of them are fine-tuned on TED corpus for 10,000 steps with 32,768 tokens per batch on 4 V100 GPUs. Adam is used for optimizing with learning rate set to 1e-5. Parameters for last 4 epochs for each model are averaged as the final parameter. While decoding, two models are ensembled for generating better translation.

We use the toolkit from the SLT.KIT²²2https://github.com/jniehues-kit/SLT.KIT for evaluation on all development set, which produces metrics including BLEU (Papineni et al., 2002), TER (Snover et al., 2006), BEER (Stanojevic and Sima’an, 2014) and CharacTER (Wang et al., 2016).

	Independent	Hybrid
LS test-clean	3.32	3.57
MC test-COMMON	17.23	15.64
CV test	31.08	29.25

Table 3: WER (word error rate) score of the ASR model evaluated on the test set of three datasets, training with or without the hybrid. LS, MC and CV represents for LibriSpeech, MuST-C and CoVoST.

4.2 Results

As shown in Table 2, our system consistently obtains strong results on development and test sets over years, except for the tst2018. Since tst2018 contains speech from lectures which distributionally deviates from TED talks. leading to lower performance. For the tst2020 and tst2021, we only report the BLEU score from the official report because references have not been published yet.

	Pretrain	Fine-tune
dev2010	31.0	33.1
tst2010	32.7	35.2
tst2013	35.3	37.8
tst2014	31.3	33.6
tst2015	33.7	36.4
tst2018	29.9	32.1

Table 4: BLEU score of MT model evaluated on the golden segmentation of development set with and without fine-tuning.

We further perform ablation study on ASR model, to evaluate the influence of multi-source training — training using the hybrid of different sources with explicit tag. Specifically, we compare the model trained on three corpora independently, with the model trained by multi-source training under the same architecture, and note that we evaluate on test sets provided by the same source as each independent training dataset rather than the same standard benchmark.

Table 3 shows that except from the LibriSpeech (LS), the performance of other two test sets improves significantly by using multi-source training. We speculate that MC and CV mutually augment each other due to their similar data distribution, while hurts LS because of large domain gap. Concretely, most sentences of MC and CV are oral language, and they are case-sensitive with punctuation. However, LS overall has better quality, this may be attributed to its written language style.

We further analyze the NMT model by evaluating on the golden segmentation of all development set. Table 4 demonstrates the performance of the NMT model before and after fine-tuning. The BLEU score is calculated with sacreBLEU (Post, 2018). Significant improvement is observed over all dev sets by continuous fine-tuning, this reveals that even if pretraining over large-scale parallel data, fine-tuning on domain data is of great importance.

5 Conclusion

In this paper, we report our work in the IWSLT-2021 offline speech translation task. Our system is structured in a cascade form, including off-shelf speaker diarization, ASR and MT model based on Transformer. The multi-source training for ASR model is demonstrated to be significantly useful to fully leveraging data sampled from different distributions, reflecting to different domain and style in our setting, which not only enlarges the size of training data, but also steers the model to decode in a specific style. Large scaled WMT news corpora enables MT model to learn domain-invariant translation knowledge adequately during pretraining, and in-domain TED corpus brings additional improvements by fine-tuning. We will investigate more in terms of end-to-end models, benefiting from both ASR and ST dataset in our future work.

References

Anastasopoulos et al. (2021) Antonios Anastasopoulos, Ondřej Bojar, Jacob Bremerman, Roldano Cattoni, Maha Elbayad, Marcello Federico, Xutai Ma, Satoshi Nakamura, Matteo Negri, Jan Niehues, Juan Pino, Elizabeth Salesky, Sebastian Stüker, Katsuhito Sudoh, Marco Turchi, Alexander Waibel, Changhan Wang, and Matthew Wiesner. 2021. FINDINGS OF THE IWSLT 2021 EVALUATION CAMPAIGN. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 1–29, Bangkok, Thailand (online). Association for Computational Linguistics.
Barrault et al. (2019) Loïc Barrault, Ondrej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1, pages 1–61. Association for Computational Linguistics.
Bentivogli et al. (2021) Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Karakanta, Alberto Martinelli, Matteo Negri, and Marco Turchi. 2021. Cascade versus direct speech translation: Do the differences still make a difference? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2873–2887. Association for Computational Linguistics.
Cattoni et al. (2021) Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2021. Must-c: A multilingual corpus for end-to-end speech translation. Comput. Speech Lang., 66:101155.
Ha et al. (2016) Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. 2016. Toward multilingual neural machine translation with universal encoder and decoder. CoRR, abs/1611.04798.
Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguistics, 5:339–351.
Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic. The Association for Computational Linguistics.
Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics.
Meignier and Merlin (2010) Sylvain Meignier and Teva Merlin. 2010. LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION. In CMU SPUD Workshop, Dallas, United States.
Ng et al. (2019) Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. Facebook fair’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1, pages 314–319. Association for Computational Linguistics.
Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Demonstrations, pages 48–53. Association for Computational Linguistics.
Panayotov et al. (2015) Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206–5210. IEEE.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL.
Park et al. (2019) Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 2613–2617. ISCA.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 186–191. Association for Computational Linguistics.
Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
Snover et al. (2006) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231.
Stanojevic and Sima’an (2014) Milos Stanojevic and Khalil Sima’an. 2014. Fitting sentence level translation evaluation with many dense features. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 202–206. ACL.
Synnaeve et al. (2019) Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Edouard Grave, Tatiana Likhomanenko, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, and Ronan Collobert. 2019. End-to-end ASR: from supervised to semi-supervised learning with modern architectures. CoRR, abs/1911.08460.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
Wang et al. (2020) Changhan Wang, Juan Miguel Pino, Anne Wu, and Jiatao Gu. 2020. Covost: A diverse multilingual speech-to-text translation corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 4197–4203. European Language Resources Association.
Wang et al. (2016) Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. 2016. Character: Translation edit rate on character level. In Proceedings of the First Conference on Machine Translation, WMT 2016, colocated with ACL 2016, August 11-12, Berlin, Germany, pages 505–510. The Association for Computer Linguistics.