Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

Abdul-Mageed, Muhammad; Elmadany, AbdelRahim; Nagoudi, El Moatez Billah; Pabbi, Dinesh; Verma, Kunal; Lin, Rannie

Full-text links:

Download:

Current browse context:

cs.SI

< prev | next >

new | recent | 2005

Computer Science > Social and Information Networks

Title: Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

Authors: Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi, Dinesh Pabbi, Kunal Verma, Rannie Lin

(Submitted on 2 May 2020 (v1), last revised 5 Feb 2021 (this version, v4))

Abstract: We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.

Subjects:	Social and Information Networks (cs.SI); Computation and Language (cs.CL)
Cite as:	arXiv:2005.06012 [cs.SI]
	(or arXiv:2005.06012v4 [cs.SI] for this version)

Submission history

From: Abdelrahim Elmadany [view email]
[v1] Sat, 2 May 2020 10:23:27 GMT (7846kb,D)
[v2] Fri, 7 Aug 2020 23:57:50 GMT (9734kb,D)
[v3] Mon, 25 Jan 2021 19:15:42 GMT (13784kb,D)
[v4] Fri, 5 Feb 2021 22:19:06 GMT (9567kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2005.06012

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Social and Information Networks

Title: Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

Submission history