GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Chen, Guoguo; Chai, Shuzhou; Wang, Guanbo; Du, Jiayu; Zhang, Wei-Qiang; Weng, Chao; Su, Dan; Povey, Daniel; Trmal, Jan; Zhang, Junbo; Jin, Mingjie; Khudanpur, Sanjeev; Watanabe, Shinji; Zhao, Shuaijiang; Zou, Wei; Li, Xiangang; Yao, Xuchen; Wang, Yongqing; Wang, Yujun; You, Zhao; Yan, Zhiyong

Full-text links:

Download:

Current browse context:

cs.SD

< prev | next >

new | recent | 2106

Computer Science > Sound

Title: GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

(Submitted on 13 Jun 2021)

Abstract: This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2106.06909 [cs.SD]
	(or arXiv:2106.06909v1 [cs.SD] for this version)

Submission history

From: Guoguo Chen [view email]
[v1] Sun, 13 Jun 2021 04:09:16 GMT (2459kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2106.06909

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Sound

Title: GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Submission history