Enhanced Offensive Language Detection Through Data Augmentation

Liu, Ruibo; Xu, Guangxuan; Vosoughi, Soroush

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2012

Computer Science > Computation and Language

Title: Enhanced Offensive Language Detection Through Data Augmentation

Authors: Ruibo Liu, Guangxuan Xu, Soroush Vosoughi

(Submitted on 5 Dec 2020)

Abstract: Detecting offensive language on social media is an important task. The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets. The dataset, however, suffers from class imbalance, where certain labels are extremely rare compared with other classes (e.g, the hateful class is only 5% of the data). In this work, we present Dager (Data Augmenter), a generation-based data augmentation method, that improves the performance of classification on imbalanced and low-resource data such as the offensive language dataset. Dager extracts the lexical features of a given class, and uses these features to guide the generation of a conditional generator built on GPT-2. The generated text can then be added to the training set as augmentation data. We show that applying Dager can increase the F1 score of the data challenge by 11% when we use 1% of the whole dataset for training (using BERT for classification); moreover, the generated data also preserves the original labels very well. We test Dager on four different classifiers (BERT, CNN, Bi-LSTM with attention, and Transformer), observing universal improvement on the detection, indicating our method is effective and classifier-agnostic.

Comments:	In ICWSM 2020 Data Challenge. Online
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2012.02954 [cs.CL]
	(or arXiv:2012.02954v1 [cs.CL] for this version)

Submission history

From: Soroush Vosoughi Dr [view email]
[v1] Sat, 5 Dec 2020 05:45:16 GMT (81kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2012.02954

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Enhanced Offensive Language Detection Through Data Augmentation

Submission history