References & Citations
Computer Science > Computation and Language
Title: Text normalization for endangered languages: the case of Ligurian
(Submitted on 16 Jun 2022 (this version), latest version 22 Dec 2023 (v2))
Abstract: Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods.
In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization. Our datasets are released to the public.
Submission history
From: Jean Maillard [view email][v1] Thu, 16 Jun 2022 00:37:55 GMT (206kb,D)
[v2] Fri, 22 Dec 2023 06:33:04 GMT (310kb,D)
Link back to: arXiv, form interface, contact.