We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Short Text Language Identification for Under Resourced Languages

Abstract: The paper presents a hierarchical naive Bayesian and lexicon based classifier for short text language identification (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 official South African languages some of which are similar languages. The algorithm is compared to recent approaches using test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) shared tasks' datasets. Remaining research opportunities and pressing concerns in evaluating and comparing LID approaches are also discussed.
Comments: Presented at NeurIPS 2019 Workshop on Machine Learning for the Developing World
Subjects: Computation and Language (cs.CL)
MSC classes: 68T50
Cite as: arXiv:1911.07555 [cs.CL]
  (or arXiv:1911.07555v2 [cs.CL] for this version)

Submission history

From: Bernardt Duvenhage [view email]
[v1] Mon, 18 Nov 2019 11:34:38 GMT (15kb)
[v2] Fri, 22 Nov 2019 04:53:48 GMT (15kb)

Link back to: arXiv, form interface, contact.