We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Text normalization for low-resource languages: the case of Ligurian

Abstract: Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.
Subjects: Computation and Language (cs.CL)
Journal reference: In Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages, p. 98-103 (2023)
Cite as: arXiv:2206.07861 [cs.CL]
  (or arXiv:2206.07861v2 [cs.CL] for this version)

Submission history

From: Jean Maillard [view email]
[v1] Thu, 16 Jun 2022 00:37:55 GMT (206kb,D)
[v2] Fri, 22 Dec 2023 06:33:04 GMT (310kb,D)

Link back to: arXiv, form interface, contact.