Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Kumar, Ritesh; Singh, Siddharth; Ratan, Shyam; Raj, Mohit; Sinha, Sonal; Lahiri, Bornini; Seshadri, Vivek; Bali, Kalika; Ojha, Atul Kr.

Full-text links:

Download:

Computer Science > Computation and Language

Title: Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Authors: Ritesh Kumar, Siddharth Singh, Shyam Ratan, Mohit Raj, Sonal Sinha, Bornini Lahiri, Vivek Seshadri, Kalika Bali, Atul Kr. Ojha

(Submitted on 26 Jun 2022)

Abstract: In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.

Comments:	Speech for Social Good Workshop, 2022, Interspeech 2022
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2206.12931 [cs.CL]
	(or arXiv:2206.12931v1 [cs.CL] for this version)

Submission history

From: Ritesh Kumar [view email]
[v1] Sun, 26 Jun 2022 17:28:38 GMT (679kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2206.12931

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

Submission history