Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

Wijeratne, Yudhanjaya; de Silva, Nisansa

Full-text links:

Download:

PDF only

Current browse context:

cs.CL

< prev | next >

new | recent | 2007

Computer Science > Computation and Language

Title: Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

Authors: Yudhanjaya Wijeratne, Nisansa de Silva

(Submitted on 15 Jul 2020)

Abstract: This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger. Both corpora have markers for their date of creation, page of origin, and content type.

Comments:	10 pages; Github repo of data linked in summary
Subjects:	Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Cite as:	arXiv:2007.07884 [cs.CL]
	(or arXiv:2007.07884v1 [cs.CL] for this version)

Submission history

From: Yudhanjaya Wijeratne [view email]
[v1] Wed, 15 Jul 2020 17:57:56 GMT (171kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2007.07884

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

Submission history