We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Abstract: In this work we provide a \textit{systematic empirical comparison} of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first establish if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for a performance difference. To disentangle the impacting variables, we train new monolingual models on the same data, but with different tokenizers, both the monolingual and the multilingual version. We find that while the pretraining data size is an important factor, the designated tokenizer of the monolingual model plays an equally important role in the downstream performance. Our results show that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2012.15613 [cs.CL]
  (or arXiv:2012.15613v1 [cs.CL] for this version)

Submission history

From: Phillip Rust [view email]
[v1] Thu, 31 Dec 2020 14:11:00 GMT (9160kb,D)
[v2] Tue, 1 Jun 2021 18:17:59 GMT (8232kb,D)

Link back to: arXiv, form interface, contact.