Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models

Kote, Nelda; Biba, Marenglen; Kanerva, Jenna; Rönnqvist, Samuel; Ginter, Filip

Full-text links:

Download:

PDF only

Current browse context:

cs.CL

< prev | next >

new | recent | 1912

Change to browse by:

Computer Science > Computation and Language

Title: Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models

Authors: Nelda Kote, Marenglen Biba, Jenna Kanerva, Samuel Rönnqvist, Filip Ginter

(Submitted on 2 Dec 2019)

Abstract: In this paper, we present the first publicly available part-of-speech and morphologically tagged corpus for the Albanian language, as well as a neural morphological tagger and lemmatizer trained on it. There is currently a lack of available NLP resources for Albanian, and its complex grammar and morphology present challenges to their development. We have created an Albanian part-of-speech corpus based on the Universal Dependencies schema for morphological annotation, containing about 118,000 tokens of naturally occuring text collected from different text sources, with an addition of 67,000 tokens of artificially created simple sentences used only in training. On this corpus, we subsequently train and evaluate segmentation, morphological tagging and lemmatization models, using the Turku Neural Parser Pipeline. On the held-out evaluation set, the model achieves 92.74% accuracy on part-of-speech tagging, 85.31% on morphological tagging, and 89.95% on lemmatization. The manually annotated corpus, as well as the trained models are available under an open license.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1912.00991 [cs.CL]
	(or arXiv:1912.00991v1 [cs.CL] for this version)

Submission history

From: Samuel Rönnqvist [view email]
[v1] Mon, 2 Dec 2019 18:50:37 GMT (138kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:1912.00991

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models

Submission history