How not to Lie with a Benchmark: Rearranging NLP Leaderboards

Tatiana, Shavrina; Valentin, Malykh

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2112

Computer Science > Computation and Language

Title: How not to Lie with a Benchmark: Rearranging NLP Leaderboards

Authors: Shavrina Tatiana, Malykh Valentin

(Submitted on 2 Dec 2021)

Abstract: Comparison with a human is an essential requirement for a benchmark for it to be a reliable measurement of model capabilities. Nevertheless, the methods for model comparison could have a fundamental flaw - the arithmetic mean of separate metrics is used for all tasks of different complexity, different size of test and training sets.
In this paper, we examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean (appropriate for averaging rates) according to their reported results. We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME. The analysis shows that e.g. human level on SuperGLUE is still not reached, and there is still room for improvement for the current models.

Comments:	Accepted to ICBINB Workshop, NeurIPS 2021
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
MSC classes:	68-06, 68T50, 68T01
ACM classes:	G.3; I.2.7
Cite as:	arXiv:2112.01342 [cs.CL]
	(or arXiv:2112.01342v1 [cs.CL] for this version)

Submission history

From: Tatiana Shavrina [view email]
[v1] Thu, 2 Dec 2021 15:40:52 GMT (240kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2112.01342v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: How not to Lie with a Benchmark: Rearranging NLP Leaderboards

Submission history