A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

Tekumalla, Ramya; Banda, Juan M.

Full-text links:

Download:

PDF only

Current browse context:

cs.SI

< prev | next >

new | recent | 2003

Computer Science > Information Retrieval

Title: A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

Authors: Ramya Tekumalla, Juan M. Banda

(Submitted on 31 Mar 2020)

Abstract: With the increase in popularity of deep learning models for natural language processing (NLP) tasks, in the field of Pharmacovigilance, more specifically for the identification of Adverse Drug Reactions (ADRs), there is an inherent need for large-scale social-media datasets aimed at such tasks. With most researchers allocating large amounts of time to crawl Twitter or buying expensive pre-curated datasets, then manually annotating by humans, these approaches do not scale well as more and more data keeps flowing in Twitter. In this work we re-purpose a publicly available archived dataset of more than 9.4 billion Tweets with the objective of creating a very large dataset of drug usage-related tweets. Using existing manually curated datasets from the literature, we then validate our filtered tweets for relevance using machine learning methods, with the end result of a publicly available dataset of 1,181,993 million tweets for public use. We provide all code and detailed procedure on how to extract this dataset and the selected tweet ids for researchers to use.

Comments:	8 tables, 2 figures, 7 pages, accepted after peer review as a workshop paper in ACM Conference on Health, Inference, and Learning (CHIL) 2020 this https URL
Subjects:	Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Cite as:	arXiv:2003.13900 [cs.IR]
	(or arXiv:2003.13900v1 [cs.IR] for this version)

Submission history

From: Juan Banda [view email]
[v1] Tue, 31 Mar 2020 01:30:24 GMT (394kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2003.13900

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Information Retrieval

Title: A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

Submission history