The Lab vs The Crowd: An Investigation into Data Quality for Neural Dialogue Models

Lopes, José; Garcia, Francisco J. Chiyah; Hastie, Helen

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2012

Change to browse by:

Computer Science > Computation and Language

Title: The Lab vs The Crowd: An Investigation into Data Quality for Neural Dialogue Models

Authors: José Lopes, Francisco J. Chiyah Garcia, Helen Hastie

(Submitted on 7 Dec 2020)

Abstract: Challenges around collecting and processing quality data have hampered progress in data-driven dialogue models. Previous approaches are moving away from costly, resource-intensive lab settings, where collection is slow but where the data is deemed of high quality. The advent of crowd-sourcing platforms, such as Amazon Mechanical Turk, has provided researchers with an alternative cost-effective and rapid way to collect data. However, the collection of fluid, natural spoken or textual interaction can be challenging, particularly between two crowd-sourced workers. In this study, we compare the performance of dialogue models for the same interaction task but collected in two different settings: in the lab vs. crowd-sourced. We find that fewer lab dialogues are needed to reach similar accuracy, less than half the amount of lab data as crowd-sourced data. We discuss the advantages and disadvantages of each data collection method.

Comments:	Accepted at Human in the Loop Dialogue Systems Workshop @NeurIPS 2020
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2012.03855 [cs.CL]
	(or arXiv:2012.03855v1 [cs.CL] for this version)

Submission history

From: José Lopes [view email]
[v1] Mon, 7 Dec 2020 17:02:00 GMT (3944kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2012.03855v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: The Lab vs The Crowd: An Investigation into Data Quality for Neural Dialogue Models

Submission history