We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

Abstract: This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling. To capture new content, our approach will run continuously to keep increasing the corpus over time.
Comments: Submitted to LREC 2020
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:1912.00159 [cs.CL]
  (or arXiv:1912.00159v1 [cs.CL] for this version)

Submission history

From: Lucy Linder [view email]
[v1] Sat, 30 Nov 2019 08:42:25 GMT (219kb,D)
[v2] Sat, 21 Mar 2020 18:18:42 GMT (222kb,D)
[v3] Tue, 16 Jun 2020 14:52:55 GMT (222kb,D)

Link back to: arXiv, form interface, contact.