MLS: A Large-Scale Multilingual Dataset for Speech Research

Pratap, Vineel; Xu, Qiantong; Sriram, Anuroop; Synnaeve, Gabriel; Collobert, Ronan

doi:10.21437/Interspeech.2020-2826

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2012

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: MLS: A Large-Scale Multilingual Dataset for Speech Research

Authors: Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert

(Submitted on 7 Dec 2020 (v1), last revised 19 Dec 2020 (this version, v2))

Abstract: This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at this http URL

Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Journal reference:	Interspeech 2020
DOI:	10.21437/Interspeech.2020-2826
Cite as:	arXiv:2012.03411 [eess.AS]
	(or arXiv:2012.03411v2 [eess.AS] for this version)

Submission history

From: Vineel Pratap [view email]
[v1] Mon, 7 Dec 2020 01:53:45 GMT (606kb,D)
[v2] Sat, 19 Dec 2020 09:18:21 GMT (622kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2012.03411

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: MLS: A Large-Scale Multilingual Dataset for Speech Research

Submission history