References & Citations
Computer Science > Social and Information Networks
Title: Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19
(Submitted on 2 May 2020 (v1), last revised 5 Feb 2021 (this version, v4))
Abstract: We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.
Submission history
From: Abdelrahim Elmadany [view email][v1] Sat, 2 May 2020 10:23:27 GMT (7846kb,D)
[v2] Fri, 7 Aug 2020 23:57:50 GMT (9734kb,D)
[v3] Mon, 25 Jan 2021 19:15:42 GMT (13784kb,D)
[v4] Fri, 5 Feb 2021 22:19:06 GMT (9567kb,D)
Link back to: arXiv, form interface, contact.