Multilingual Culture-Independent Word Analogy Datasets

Ulčar, Matej; Vaik, Kristiina; Lindström, Jessica; Dailidėnaitė, Milda; Robnik-Šikonja, Marko

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 1911

Change to browse by:

Computer Science > Computation and Language

Title: Multilingual Culture-Independent Word Analogy Datasets

Authors: Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja

(Submitted on 22 Nov 2019 (v1), last revised 27 Mar 2020 (this version, v2))

Abstract: In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We redesigned the original monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.

Comments:	7 pages, LREC2020 conference
Subjects:	Computation and Language (cs.CL)
ACM classes:	J.5
Journal reference:	Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4074-4080
Cite as:	arXiv:1911.10038 [cs.CL]
	(or arXiv:1911.10038v2 [cs.CL] for this version)

Submission history

From: Matej Ulčar [view email]
[v1] Fri, 22 Nov 2019 13:39:06 GMT (21kb)
[v2] Fri, 27 Mar 2020 15:32:16 GMT (26kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:1911.10038

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Multilingual Culture-Independent Word Analogy Datasets

Submission history