Speech Wikimedia: A 77 Language Multilingual Speech Dataset

Rafael Mosquera Gómez Julian Eusse Juan Ciro Daniel Galvez Ryan Hileman Kurt Bollacker David Kanter

Abstract

The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.

Machine Learning, ICML

1 Introduction

Speech research has witnessed remarkable advancements in recent years, largely driven by the availability of vast amounts of data. Tasks such as automatic speech recognition (ASR), speaker recognition, and speech translation have reached robust results even in the presence of background noise, jargon, and different accents. Nevertheless, one of the fundamental challenges in speech research is the scarcity of multilingual datasets.

In this paper, we introduce the Speech Wikimedia Dataset, a supervised dataset consisting of 1780 hours of audio files, each with one one or more transcripts. The last point bears repeating: much (approximately 25%) of the dataset has multiple associated transcripts, each having its own language, which is rare in datasets sourced from the internet. The dataset is specifically curated from Wikimedia Commons to address some of the key challenges in the speech recognition, machine translation, and speech translation spaces, particularly the need for diverse multilingual data and appropriate licensing for academic and commercial usage.

The paper is organized as follows. In section 2, we describe the dataset itself and training tasks that it can support; in section 3, we describe related work; in section 4, we describe some limitations of the dataset today; in section 5, we conclude with some future work that this dataset can enable.

2 Dataset Description

To construct the Speech Wikimedia dataset, we downloaded raw video and audio from Wikimedia Commons (“https://commons.wikimedia.org”), which allows only content that is ”free” (wik, ). For our purposes, this means data under CC-BY or CC-BY-SA license, or otherwise public domain. After downloading, the data was converted to 16kHz monochannel FLAC using ffmpeg. The data is uploaded to Huggingface at https://huggingface.co/datasets/MLCommons/speech-wikimedia

We give statistics for three possible tasks that this dataset can be used for: speech recognition, speech translation, and machine translation.

2.1 Licensing Information

Given that all data is public domain, CC-BY-licensed, or CC-BY-SA licensed, we are licensing the dataset as CC-BY-SA. Following the requirements imposed by the CC-BY and CC-BY-SA licenses of our sources, accreditation is provided in the linked credits.json file.

2.2 Audio with Subtitles in the Same Language for Speech Recognition Task

In order to determine the amount of data available for the ASR task we used only those audios and transcriptions where the language coincided. Since Wikimedia Commons’s transcripts’ filenames contain the language of the contained text, we simply extracted the languages from these filenames. For example, the file “Elephants_Dream.ogv.en.srt” is in English, as indicated by the “.en.” substring.

Given that we didn’t initially have the audio language for each file, Whisper’s (Radford et al., 2022) language detection pipeline was used. A total of 77 different languages were detected, with English, Dutch, German, Russian, and Spanish being the most common. 69.07% of the 1780 hour dataset comprises audio-transcription pairs in the same language. We present the number of transcribed hours of each language in Table 1 in the appendix.

2.3 Audio with Subtitles in Different Languages for Speech Translation Task

For speech translation, we focused on audio-transcript pairs that had different languages, which corresponds to 31% of the rest of the 1780 hours. After filtering audios and transcriptions with unknown languages, we were left with a total of 628.8 hours of audio with transcripts in different languages. We present the hours of audio for the 20 most common language pairs in Table 2 in the appendix.

2.4 Transcript Language Pairs (Bitexts) for Machine Translation Task

While the speech translation task relies on audio from one language with a transcription from another language, machine translation focuses on pure text translation. We find paired texts by enumerating all pairs of transcripts associated with a single audio. This is interesting in particular because approximately 10.93% of the audio files have transcripts in at least three languages.

In Table 3 in the appendix, we present total hours, number of words and occurrences of different pairs of languages for the 20 most common language pairs in the dataset.

2.5 Topic Distribution and Audio Content

We were also interested in determining which topics were covered across the dataset. In order to analyze this, we ran a zero shot classifier (Maiya, 2020) with labels ranging different topics, and recorded the hours of audio for each of the topics. Results are depicted in Table 4 in the Appendix section. Popular topics were current events, history, and general non-fiction references.

Based on listening to several files, we also discovered that several audios are public speeches, music, and clearly pronounced single words, probably for pronunciation dictionaries like wiktionary.

3 Related Work

In this section we provide an overview of previous, similar datasets.

Mozilla Common Voice (Ardila et al., 2020) is a CC0-licensed 17,690-hour public domain corpus of single speaker read speech in 108 languages created by volunteers. In contrast, Speech Wikimedia has much more diverse audio sources.

Multilingual Librispeech (Pratap et al., 2020) is a CC-BY dataset of 50,500 hours of transcribed read speech in eight languages; 6,000 of its hours are non-English. Meanwhile, our dataset contains 77 languages, and the majority of the data is also in English.

VoxPopuli (Wang et al., 2021) is CC0-licensed dataset containing an unsupervised set of 400,000 hours in 23 languages, and 1,500 hours of transcribed audio in 15 languages. Like our dataset, it also contains a subset suitable for a speech translation task.

Multilingual Spoken Words Corpus (Mazumder et al., 2021) is a CC-BY licensed 6,000-hour dataset, containing more than 340,000 keywords in 50 different languages. It is for training keyword spotting models, not speech recognition models.

opensubtitles (Lison & Tiedemann, 2016) is a machine translation dataset containing 1,782 language pairings extracted from movie subtitles in 62 languages. Given the data source, it is not licensed for commercial usage. In contrast, the Speech Wikimedia Dataset has 929 language pairings from 77 languages.

4 Limitations

The raw data is available publicly online on Hugging Face as mentioned before; however, this data is not yet processed via forced alignment of audio to transcript and bitext word alignment for transcript to transcript, and thus not able to be used immediately in training models.

We removed all video data when converting to FLAC. In future work, this data could be helpful for a multimodal task.

While collecting this dataset, we realized that there is also a collection of audio data in Wikimedia Commons without any transcripts. We have not explored this subset and have not made it available at this time, however.

Given the small size of the dataset, we are not providing a training-test split.

5 Conclusions

We introduce the Speech Wikimedia Dataset, a collection of audio files with transcriptions in multiple languages extracted form Wikimedia Commons. The dataset encompasses over 1,780 hours of transcribed speech in multiple languages. The CC-BY-SA license enables commercial usage. This is the first non-read multilingual speech dataset allowing for commercial usage that we are aware of other than VoxPopuli.

References

(1) Wikimedia commons:licensing. https://commons.wikimedia.org/wiki/Commons:Licensing. Accessed: 2023-05-23.
Ardila et al. (2020) Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., and Weber, G. Common voice: A massively-multilingual speech corpus, 2020.
Lison & Tiedemann (2016) Lison, P. and Tiedemann, J. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 923–929, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). URL https://aclanthology.org/L16-1147.
Maiya (2020) Maiya, A. S. ktrain: A low-code library for augmented machine learning. CoRR, abs/2004.10703, 2020. URL https://arxiv.org/abs/2004.10703.
Mazumder et al. (2021) Mazumder, M., Chitlangia, S., Banbury, C., Kang, Y., Ciro, J. M., Achorn, K., Galvez, D., Sabini, M., Mattson, P., Kanter, D., Diamos, G., Warden, P., Meyer, J., and Reddi, V. J. Multilingual spoken words corpus. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=c20jiJ5K2H.
Pratap et al. (2020) Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert, R. ISCA, oct 2020. doi: 10.21437/interspeech.2020-2826. URL https://doi.org/10.21437%2Finterspeech.2020-2826.
Radford et al. (2022) Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision, 2022.
Wang et al. (2021) Wang, C., Rivière, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., Williamson, M., Pino, J. M., and Dupoux, E. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. CoRR, abs/2101.00390, 2021. URL https://arxiv.org/abs/2101.00390.

Appendix A Figures

Table 1: Automatic Speech Recognition Task

	Hours of audio
Language
English (en)	1488.765773
Dutch (nl)	22.167223
German (de)	12.658670
French (fr)	7.163889
Russian (ru)	6.985941
Spanish (es)	6.184720
Latin (la)	3.066669
Polish (pl)	3.045028
Japanese (ja)	2.216300
Bengali (bn)	2.126192
Swedish (sv)	1.468487
Chinese (zh)	1.456599
Italian (it)	1.419221
Portuguese (pt)	1.344584
Welsh (cy)	1.141955
Basque (eu)	1.008435
Hindi (hi)	0.886795
Arabic (ar)	0.572991
Ukrainian (uk)	0.441770
Slovenian (sl)	0.377644
Korean (ko)	0.367545
Hebrew (he)	0.238240
Indonesian (id)	0.207363
Thai (th)	0.177196
Catalan (ca)	0.161531
Greek (el)	0.160628
Danish (da)	0.150981
Persian (fa)	0.132622
Vietnamese (vi)	0.131922
Marathi (mr)	0.124219
Punjabi (pa)	0.090774
Malayalam (ml)	0.078354
Telugu (te)	0.065369
Kannada (kn)	0.033602
Hungarian (hu)	0.030055
Estonian (et)	0.029325
Turkish (tr)	0.024743
Finnish (fi)	0.022719
Czech (cs)	0.021120
Telugu (tl)	0.016138
Romanian (ro)	0.015280
Slovak (sk)	0.000766
Tamil (ta)	0.000364

Table 2: Speech Translation Task

Audio Language	Transcript Language	Duration(hours)
English	Spanish	67.115705
English	Arabic	43.398845
English	French	38.163062
English	Portuguese	30.952778
English	Dutch	24.165356
English	German	23.678866
English	Italian	23.442334
English	Russian	15.557022
Dutch	English	14.409074
English	Polish	12.865772
Latin	English	11.722308
English	Chinese	11.182589
Hindi	English	10.256298
English	Turkish	9.471801
English	Japanese	8.782186
Welsh	English	8.761795
English	Vietnamese	6.731008
Russian	English	6.037366
Dutch	Russian	5.438943

Table 3: Transcript Language Pairs Statistics

Language Pair	Total Hours	Source Language Token Count	Target Language Token Count	Bitexts
English-Spanish	135.989042	481391.0	486965.0	629
English-French	85.782796	262040.0	255998.0	343
English-Portuguese	57.887887	200853.0	194911.0	197
English-Russian	55.501208	149706.0	119638.0	348
German-English	55.449766	149499.0	168156.0	394
Spanish-Portuguese	54.185978	200486.0	191356.0	166
Spanish-French	51.878961	178583.0	182886.0	213
English-Dutch	49.582888	182302.0	166567.0	164
English-Italian	47.008579	131800.0	125312.0	200
French-Portuguese	38.802198	147000.0	138013.0	146
Arabic-English	38.239120	106115.0	136589.0	182
German-Spanish	36.046692	110171.0	127857.0	211
Arabic-Spanish	34.548516	109102.0	136234.0	139
Arabic-French	34.121227	110543.0	138088.0	134
German-French	33.843628	94528.0	111353.0	204
French-Italian	33.791085	117284.0	113286.0	150
Spanish-Italian	33.368969	117633.0	109450.0	162
Arabic-Portuguese	29.675284	96408.0	113835.0	98
German-Italian	28.917809	85169.0	96215.0	154
French-Russian	27.784403	80372.0	63862.0	155

Table 4: Distribution of topics and their durations

Topic	Duration(hours)
current events	641.809422
other	406.496021
history	154.644719
health	151.203017
general reference	114.263664
society	95.286515
political	46.760335
technology	46.079005
number	45.406569
business	40.940404
science	34.944243
culture	27.520957
languages	26.272240
city	24.718620
location	15.258170
software	8.970613
geography	8.372475
animal	7.417614
religion	7.382679
philosophy	6.863642
art	5.678405
entertainment	5.076250
mathematics	2.186313
crypto	1.731447
gaming	1.252531
engineering	0.154807