BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Scialom, Thomas; Hill, Felix

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2110

Computer Science > Computation and Language

Title: BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Authors: Thomas Scialom, Felix Hill

(Submitted on 18 Oct 2021)

Abstract: Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language -- functions that score system output given the context and/or human reference responses -- of critical importance. However, different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others. There is currently no simple, unified way to compare, analyse or evaluate metrics across a representative set of tasks. Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics), a resource to make research into new metrics itself easier to evaluate. BEAMetrics users can quickly compare existing and new metrics with human judgements across a diverse set of tasks, quality dimensions (fluency vs. coherence vs. informativeness etc), and languages. As generation experts might predict, BEAMetrics reveals stark task-dependent differences between existing metrics, and consistently poor performance on tasks with complex answer spaces or high reliance on general knowledge. While this analysis highlights a critical issue facing current research practice, BEAMetrics also contribute to its resolution by facilitating research into better metrics -- particularly those that can account for the complex interaction between context and general knowledge inherent to many modern NLP applications. BEAMetrics is available under the MIT License: this https URL

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2110.09147 [cs.CL]
	(or arXiv:2110.09147v1 [cs.CL] for this version)

Submission history

From: Thomas Scialom [view email]
[v1] Mon, 18 Oct 2021 10:03:19 GMT (1148kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2110.09147

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Submission history