We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.SI

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Social and Information Networks

Title: Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

Abstract: We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.
Subjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL)
Cite as: arXiv:2005.06012 [cs.SI]
  (or arXiv:2005.06012v4 [cs.SI] for this version)

Submission history

From: Abdelrahim Elmadany [view email]
[v1] Sat, 2 May 2020 10:23:27 GMT (7846kb,D)
[v2] Fri, 7 Aug 2020 23:57:50 GMT (9734kb,D)
[v3] Mon, 25 Jan 2021 19:15:42 GMT (13784kb,D)
[v4] Fri, 5 Feb 2021 22:19:06 GMT (9567kb,D)

Link back to: arXiv, form interface, contact.