NSINA: A News Corpus for Sinhala

Hettiarachchi, Hansi; Premasiri, Damith; Uyangodage, Lasitha; Ranasinghe, Tharindu

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2403

Computer Science > Computation and Language

Title: NSINA: A News Corpus for Sinhala

Authors: Hansi Hettiarachchi, Damith Premasiri, Lasitha Uyangodage, Tharindu Ranasinghe

(Submitted on 25 Mar 2024)

Abstract: The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSINA, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSINA aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSINA is the largest news corpus for Sinhala, available up to date.

Comments:	Accepted to LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2403.16571 [cs.CL]
	(or arXiv:2403.16571v1 [cs.CL] for this version)

Submission history

From: Tharindu Ranasinghe Dr [view email]
[v1] Mon, 25 Mar 2024 09:36:51 GMT (926kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2403.16571

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: NSINA: A News Corpus for Sinhala

Submission history