References & Citations
Computer Science > Social and Information Networks
Title: Multi-class Twitter Data Categorization and Geocoding with a Novel Computing Framework
(Submitted on 8 May 2019 (this version), latest version 28 Aug 2019 (v3))
Abstract: Transportation data analysis is becoming a major area of computing application. Operation and management of transportation system have been transforming with the advancements in computing technology. This study presents such an advancement in transportation data analysis with a novel computing framework. This study presents the Labelled Latent Dirichlet Allocation (L-LDA)-incorporated Support Vector Machine (SVM) classifier with the supporting computing strategy for using publicly available Twitter data, which is 1% of the total twitter data, in determining transportation related events (i.e., incidents, congestions, special events, construction and other events) to provide reliable information to travelers. The analytical approach includes analyzing tweets using text classification and geocoding locations based on string similarity. A case study conducted for the New York City and its surrounding areas demonstrates the feasibility of the analytical approach. In total, almost 700,010 tweets are analyzed to extract relevant transportation related information for one week. For a large geographic area like New York City and its surrounding areas, using parallel computation, 30 times speedup is achieved compared to the sequential processing in analyzing transportation related tweets. The SVM classifier achieves more than 85% accuracy in identifying transportation-related tweets from structured data. To further categorize the transportation related tweets into sub-classes: incident, congestion, construction, special events, and other events, three supervised classifiers are used: L-LDA, SVM, and L-LDA incorporated SVM. The analytical framework, which uses the L-LDA incorporated SVM, can classify roadway transportation related data from Twitter with over 98.3% accuracy, which is significantly higher than the accuracies achieved by standalone L-LDA and SVM.
Submission history
From: Sakib Khan [view email][v1] Wed, 8 May 2019 05:08:59 GMT (1397kb)
[v2] Thu, 18 Jul 2019 14:10:58 GMT (1273kb)
[v3] Wed, 28 Aug 2019 19:04:13 GMT (1339kb)
Link back to: arXiv, form interface, contact.