The Authenticity Gap in Human Evaluation

Ethayarajh, Kawin; Jurafsky, Dan

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2205

Computer Science > Computation and Language

Title: The Authenticity Gap in Human Evaluation

Authors: Kawin Ethayarajh, Dan Jurafsky

(Submitted on 24 May 2022 (v1), last revised 3 Nov 2022 (this version, v2))

Abstract: Human ratings are the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. Analyzing this standard protocol through the lens of utility theory in economics, we identify the implicit assumptions it makes about annotators. These assumptions are often violated in practice, in which case annotator ratings cease to reflect their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For the latter, we propose a new human evaluation protocol called $\textit{system-level probabilistic assessment}$ (SPA). When human evaluation of stories is done with SPA, we can recover the ordering of GPT-3 models by size, with statistically significant results. However, when human evaluation is done with the standard protocol, less than half of the expected preferences can be recovered (e.g., there is no significant difference between $\texttt{curie}$ and $\texttt{davinci}$, despite using a highly powered test).

Comments:	EMNLP 2022
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2205.11930 [cs.CL]
	(or arXiv:2205.11930v2 [cs.CL] for this version)

Submission history

From: Kawin Ethayarajh [view email]
[v1] Tue, 24 May 2022 09:51:27 GMT (1094kb,D)
[v2] Thu, 3 Nov 2022 03:04:39 GMT (1144kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2205.11930

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: The Authenticity Gap in Human Evaluation

Submission history