WeDef: Weakly Supervised Backdoor Defense for Text Classification

Jin, Lesheng; Wang, Zihan; Shang, Jingbo

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2205

Computer Science > Computation and Language

Title: WeDef: Weakly Supervised Backdoor Defense for Text Classification

Authors: Lesheng Jin, Zihan Wang, Jingbo Shang

(Submitted on 24 May 2022 (v1), last revised 28 Oct 2022 (this version, v2))

Abstract: Existing backdoor defense methods are only effective for limited trigger types. To defend different trigger types at once, we start from the class-irrelevant nature of the poisoning process and propose a novel weakly supervised backdoor defense framework WeDef. Recent advances in weak supervision make it possible to train a reasonably accurate text classifier using only a small number of user-provided, class-indicative seed words. Such seed words shall be considered independent of the triggers. Therefore, a weakly supervised text classifier trained by only the poisoned documents without their labels will likely have no backdoor. Inspired by this observation, in WeDef, we define the reliability of samples based on whether the predictions of the weak classifier agree with their labels in the poisoned training set. We further improve the results through a two-phase sanitization: (1) iteratively refine the weak classifier based on the reliable samples and (2) train a binary poison classifier by distinguishing the most unreliable samples from the most reliable samples. Finally, we train the sanitized model on the samples that the poison classifier predicts as benign. Extensive experiments show that WeDefis effective against popular trigger-based attacks (e.g., words, sentences, and paraphrases), outperforming existing defense methods.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2205.11803 [cs.CL]
	(or arXiv:2205.11803v2 [cs.CL] for this version)

Submission history

From: Zihan Wang [view email]
[v1] Tue, 24 May 2022 05:53:11 GMT (659kb,D)
[v2] Fri, 28 Oct 2022 22:20:44 GMT (660kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2205.11803

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: WeDef: Weakly Supervised Backdoor Defense for Text Classification

Submission history