Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

Jarrar, Mustafa; Zaraket, Fadi A; Hammouda, Tymaa; Alavi, Daanish Masood; Waahlisch, Martin

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2212

Computer Science > Computation and Language

Title: Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

Authors: Mustafa Jarrar, Fadi A Zaraket, Tymaa Hammouda, Daanish Masood Alavi, Martin Waahlisch

(Submitted on 13 Dec 2022 (v1), last revised 17 Dec 2022 (this version, v2))

Abstract: This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (~ 1.05M tokens) was collected automatically from Twitter. The corpora of the other three dialects (~ 50K tokens each) came manually from Facebook and YouTube posts and comments.
Thirty five (35) annotators who are native speakers of the target dialects carried out the annotations. The annotators segemented all words in the four corpora into prefixes, stems and suffixes and labeled each with different morphological features such as part of speech, lemma, and a gloss in English. An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the annation. The annotators were trained on a set of guidelines and on how to use ADAT. We developed ADAT to assist the annotators and to ensure compatibility with SAMA and Curras tagsets. The tool is open source, and the four corpora are also available online.

Subjects:	Computation and Language (cs.CL); Digital Libraries (cs.DL)
Cite as:	arXiv:2212.06468 [cs.CL]
	(or arXiv:2212.06468v2 [cs.CL] for this version)

Submission history

From: Fadi Zaraket [view email]
[v1] Tue, 13 Dec 2022 10:37:10 GMT (710kb,D)
[v2] Sat, 17 Dec 2022 12:37:29 GMT (710kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2212.06468

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

Submission history