We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computation and Language

Title: gaBERT -- an Irish Language Model

Abstract: The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many Natural Language Processing tasks. Over 120 monolingual BERT models covering over 50 languages have been released, as well as a multilingual model trained on 104 languages. We introduce, gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We release gaBERT and related code to the community.
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2107.12930 [cs.CL]
  (or arXiv:2107.12930v2 [cs.CL] for this version)

Submission history

From: James Barry [view email]
[v1] Tue, 27 Jul 2021 16:38:53 GMT (713kb,D)
[v2] Wed, 28 Jul 2021 08:20:27 GMT (712kb,D)

Link back to: arXiv, form interface, contact.