References & Citations
Computer Science > Computation and Language
Title: NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis
(Submitted on 20 Jan 2022 (v1), last revised 18 Jun 2022 (this version, v3))
Abstract: Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a rangeof pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptivefine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivizeresearch on sentiment analysis in under-represented languages.
Submission history
From: Shamsuddeen Hassan Muhammad [view email][v1] Thu, 20 Jan 2022 16:28:06 GMT (2679kb,D)
[v2] Fri, 28 Jan 2022 15:11:23 GMT (2679kb,D)
[v3] Sat, 18 Jun 2022 09:48:10 GMT (2679kb,D)
Link back to: arXiv, form interface, contact.