We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Computation and Language

Title: An Empirical Study on the Overlapping Problem of Open-Domain Dialogue Datasets

Abstract: Open-domain dialogue systems aim to converse with humans through text, and dialogue research has heavily relied on benchmark datasets. In this work, we observe the overlapping problem in DailyDialog and OpenSubtitles, two popular open-domain dialogue benchmark datasets. Our systematic analysis then shows that such overlapping can be exploited to obtain fake state-of-the-art performance. Finally, we address this issue by cleaning these datasets and setting up a proper data processing procedure for future research.
Comments: Accepted by LREC 2022
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
ACM classes: I.2.7; I.2.6
Cite as: arXiv:2201.06219 [cs.CL]
  (or arXiv:2201.06219v2 [cs.CL] for this version)

Submission history

From: Yuqiao Wen [view email]
[v1] Mon, 17 Jan 2022 05:12:13 GMT (317kb,D)
[v2] Sun, 8 May 2022 23:24:50 GMT (372kb,D)

Link back to: arXiv, form interface, contact.