We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Information Retrieval

Title: A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

Abstract: With the increase in popularity of deep learning models for natural language processing (NLP) tasks, in the field of Pharmacovigilance, more specifically for the identification of Adverse Drug Reactions (ADRs), there is an inherent need for large-scale social-media datasets aimed at such tasks. With most researchers allocating large amounts of time to crawl Twitter or buying expensive pre-curated datasets, then manually annotating by humans, these approaches do not scale well as more and more data keeps flowing in Twitter. In this work we re-purpose a publicly available archived dataset of more than 9.4 billion Tweets with the objective of creating a very large dataset of drug usage-related tweets. Using existing manually curated datasets from the literature, we then validate our filtered tweets for relevance using machine learning methods, with the end result of a publicly available dataset of 1,181,993 million tweets for public use. We provide all code and detailed procedure on how to extract this dataset and the selected tweet ids for researchers to use.
Comments: 8 tables, 2 figures, 7 pages, accepted after peer review as a workshop paper in ACM Conference on Health, Inference, and Learning (CHIL) 2020 this https URL
Subjects: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Cite as: arXiv:2003.13900 [cs.IR]
  (or arXiv:2003.13900v1 [cs.IR] for this version)

Submission history

From: Juan Banda [view email]
[v1] Tue, 31 Mar 2020 01:30:24 GMT (394kb)

Link back to: arXiv, form interface, contact.