Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Choi, Juhwan; Yun, Jungmin; Jin, Kyohoon; Kim, YoungBin

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2404

Computer Science > Computation and Language

Title: Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Authors: Juhwan Choi, Jungmin Yun, Kyohoon Jin, YoungBin Kim

(Submitted on 15 Apr 2024)

Abstract: The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation.
In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought (CoT) and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.09682 [cs.CL]
	(or arXiv:2404.09682v1 [cs.CL] for this version)

Submission history

From: Juhwan Choi [view email]
[v1] Mon, 15 Apr 2024 11:36:10 GMT (9660kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.09682

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Computer Science > Computation and Language

Title: Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Submission history